library(rvest)
## Loading required package: xml2
library(NLP)
library(openNLP)
library(ggmap)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(rworldmap)
## Loading required package: sp
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
library(rworldxtra)
library(openNLPmodels.en)
This is an R Markdown document that contains the Basic NLP and Named Entity Extraction and plot for the Fortune Company Johnson & Johnson. For more details on the data used for extracting the data on Johnson & Johnson infer from the below webpage https://en.wikipedia.org/wiki/Johnson_%26_Johnson.
page = read_html('https://en.wikipedia.org/wiki/Johnson_%26_Johnson')
From the data extracted for the Johnson & Johnson Wikipedia page the paragraphs containing text data are extracted by the function html_text. Later from the paragraph the text alone is extracted excluding the images and the reference link.
text = html_text(html_nodes(page,'p'))
text = text[text != ""]
text = gsub("\\[[0-9]]|\\[[0-9][0-9]]|\\[[0-9][0-9][0-9]]","",text)
text[1:4]
## [1] "Johnson & Johnson is an American multinational medical devices, pharmaceutical and consumer packaged goods manufacturer founded in 1886. Its common stock is a component of the Dow Jones Industrial Average and the company is listed among the Fortune 500."
## [2] "Johnson & Johnson is headquartered in New Brunswick, New Jersey, directly adjacent to the campus of Rutgers University, the consumer division being located in Skillman, New Jersey. The corporation includes some 250 subsidiary companies with operations in 60 countries and products sold in over 175 countries. Johnson & Johnson had worldwide sales of $70.1 billion during calendar year 2015."
## [3] "Johnson & Johnson's brands include numerous household names of medications and first aid supplies. Among its well-known consumer products are the Band-Aid Brand line of bandages, Tylenol medications, Johnson's baby products, Neutrogena skin and beauty products, Clean & Clear facial wash and Acuvue contact lenses."
## [4] "Johnson & Johnson operates over 250 companies in what is termed \"the Johnson & Johnson family of companies\". The company operates in three broad divisions; Consumer Healthcare, Medical Devices and Pharmaceuticals."
The data is then collapsed into a a single text so that it would become a single document and the annotations could be performed on it.Each para is collapsed and seperated by a space.
text = paste(text,collapse = " ")
text = as.String(text)
text
## Johnson & Johnson is an American multinational medical devices, pharmaceutical and consumer packaged goods manufacturer founded in 1886. Its common stock is a component of the Dow Jones Industrial Average and the company is listed among the Fortune 500. Johnson & Johnson is headquartered in New Brunswick, New Jersey, directly adjacent to the campus of Rutgers University, the consumer division being located in Skillman, New Jersey. The corporation includes some 250 subsidiary companies with operations in 60 countries and products sold in over 175 countries. Johnson & Johnson had worldwide sales of $70.1 billion during calendar year 2015. Johnson & Johnson's brands include numerous household names of medications and first aid supplies. Among its well-known consumer products are the Band-Aid Brand line of bandages, Tylenol medications, Johnson's baby products, Neutrogena skin and beauty products, Clean & Clear facial wash and Acuvue contact lenses. Johnson & Johnson operates over 250 companies in what is termed "the Johnson & Johnson family of companies". The company operates in three broad divisions; Consumer Healthcare, Medical Devices and Pharmaceuticals. Inspired by a speech by antiseptic advocate Joseph Lister, Robert Wood Johnson joined his brothers James Wood Johnson and Edward Mead Johnson to create a line of ready-to-use surgical dressings in 1885. The company produced its first products in 1886 and incorporated in 1887. Robert Wood Johnson served as the first president of the company. He worked to improve sanitation practices in the nineteenth century, and lent his name to a hospital in New Brunswick, New Jersey. Upon his death in 1910, he was succeeded in the presidency by his brother James Wood Johnson until 1932, and then by his son, Robert Wood Johnson II. Robert Wood Johnson's granddaughter, Mary Lea Johnson Richards, was the first baby to appear on a Johnson & Johnson baby powder label. His great-grandson, Jamie Johnson, made a documentary called Born Rich about the experience of growing up as the heir to one of the world's greatest fortunes. McNeil Consumer Healthcare was founded on March 16, 1879 by 23-year-old Robert McNeil. In 1904, one of McNeil's sons, Robert Lincoln McNeil, became part of the company and together they created McNeil Laboratories in 1933. The company focused on direct marketing of prescription drugs to hospitals, pharmacists, and doctors. Development of acetaminophen began under the leadership of Robert L. McNeil, Jr., who later served as the firm's chairman. In 1959, Johnson & Johnson acquired McNeil Laboratories and a year later the company was able to sell Tylenol for the first time ever, without a prescription. In 1977, two subsidiary companies were created; McNeil medicals products and McNeil Consumer Products Company (also known as McNeil Consumer Healthcare). The focus of McNeil medicals products is to market prescription drugs. In 1993 McNeil medicals products merged with the Ortho Pharmaceutical to form Ortho-McNeil Pharmaceutical. In 2001 McNeil Consumer Healthcare changed its name to McNeil Consumer & Specialty medicals products. However, it was later changed to "McNeil Consumer Healthcare". The company markets over-the-counter and prescription medicals products including complete lines of Tylenol and Motrin IB (ibuprofen) products for adults and children. In 1933, Swiss chemist Bernhard Joos set up a small research laboratory in Schaffhausen, Switzerland. This set the basis for the founding of Chemische Industrie-Labor AG (Chemical Industry Laboratory AG or Cilag) on 12 May 1936. In 1959, Cilag joined the Johnson & Johnson family of companies. In the early nineties the marketing organizations of Cilag and Janssen Pharmaceutica were joined to form Janssen-Cilag. The non-marketing activities of both companies still operate under their original name. Cilag continues to have operations under the Cilag name in Switzerland, ranging from research and development through manufacturing and international services. In August 2014 Cilag acquired Covagen a biopharmaceutical company which specialises in the development of multi-specific protein based therapeutics. As part of the acquisition Cilag wll gain access to Covagen<U+0092>s lead drug candidate, COVA 322, a bi-specific anti-tumor necrosis factor (TNF)-alpha/anti-interleukin (IL)-17A FynomAb, is in a Phase Ib study for psoriasis. Janssen Pharmaceuticals can be traced back to 1933. In 1933 Constant Janssen, the father of Paul Janssen, acquired the right to distribute the pharmaceutical products of Richter, a Hungarian pharmaceutical company, for Belgium, the Netherlands and Belgian Congo. On 23 October 1934, he founded the N.V. Produkten Richter in Turnhout. After the Second World War, the name for the company products was changed to Eupharma, although the company name Richter would remain until 1956. Paul Janssen founded his own research laboratory in 1953 on the third floor of the building in the Statiestraat, still within the Richter-Eurpharma company of his father. On 5 April 1956, the name of the company was changed to NV Laboratoria Pharmaceutica C. Janssen (named after Constant Janssen). On 2 May 1958, the research department in Beerse became a separate legal entity, the N.V. Research Laboratorium C. Janssen. On 24 October 1961, the company was acquired by the American corporation Johnson & Johnson. On 10 February 1964, the name was changed to Janssen Pharmaceutica N.V. Between 1990 and 2004, Janssen Pharmaceuticals expanded worldwide, with the company grew in size to approximately 28000 employees worldwide. In 1999, clinical research and non-clinical development become a global organization within Johnson & Johnson. In 2001, part of the research activities was transferred to the United States with the reorganization of research activities in the Johnson & Johnson Pharmaceutical Research and Development organization. The research activities of the Janssen Research Foundation and the R.W. Johnson Pharmaceutical Research Institute were merged into the new global research organization. On 27 October 2004, the Paul Janssen Research Center, for discovery research, was inaugurated. In August 2013, the company acquired Aragon Pharmaceuticals, Inc.. In November 2014, the company acquired Alios BioPharma, Inc. for $1.75 billion. As a result of the purchase, Alios was incorporated into the infectious diseases therapeutic area of the Janssen Pharmaceutical Companies of Johnson & Johnson. In March 2015, Janssen licensed Tipofarnib (a farnesyl transferase inhibitor) to Kura Oncology who will assume sole responsibility for developing and commercialising the anti-cancer drug. Later in the same month the company announced that Galapagos Pharma and regained the rights to the anti-inflammatory drug candidate GLPG1690 as well as two other compounds including GLPG1205 (a first-in-class inhibitor of GPR84). Finally, in March, the company acquired XO1 Limited In November 2015, the company acquired Novira Therapeutics, Inc., gaining the lead candidate, NVR 3-778. DePuy was acquired by J&J in 1998, rolling it into the Johnson & Johnson Medical Devices group. On June 14, 2012, Johnson and Johnson completed the acquisition of Synthes for $19.7 billion, which was then integrated with the DePuy franchise to establish the DePuy Synthes Companies of Johnson & Johnson which includes; Codman & Shurteff, Inc., DePuy Mitek, Inc., DePuy Orthopaedics, Inc. and DePuy Spine, Inc. Janssen Biotech, Inc., formerly known as Centocor Biotech, Inc., is a biotechnology company that was founded in Philadelphia in 1979. In 1982 Centocor transitioned into a publicly traded company. In 1999, Centocor became a wholly owned subsidiary of Johnson & Johnson. Since the acquisition, Janssen Biotech increased its annual sales from $500 million to more than $2 billion. During the same period, research and development investment increased from $75 million to more than $300 million. In 2008, Centocor, Inc. and Ortho Biotech Inc. merged to form Centocor Ortho Biotech Inc. In June 2010 Centocor Ortho Biotech acquired RespiVert, a privately held drug discovery company focused on developing small-molecule, inhaled therapies for the treatment of pulmonary diseases. In June 2011 Centocor Ortho Biotech changed its name to Janssen Biotech, Inc. as part of a global effort to unite the Janssen Pharmaceutical Companies around the world under a common identity. In December 2014, the company announced it would co-develop MacroGenics cancer drug candidate (MGD011) which targets both CD19 and CD3 proteins in treating B-cell malignant tumours. This could net MacroGenics up to $700 million. In January 2015, the company announced it will utilise Isis Pharmaceuticals' RNA-targeting technology to discover and develop antisense drugs targeting autoimmune disorders of the gastrointestinal tract, with the partnership potentially generating up to $835 million for Isis. In 1915, George F. Merson opened a facility in Edinburgh for the manufacturing, packaging and sterilising of catgut, silk and nylon sutures. Johnson & Johnson acquired Mr. Merson<U+0092>s company in 1947, and this was renamed Ethicon Suture Laboratories. In 1953 this became Ethicon Inc. In 1992, Ethicon was restructured, and Ethicon Endo-Surgery, Inc. became a separate corporate entity. During the 1990s, Ethicon diversified into new and advanced products and technologies and formed four different companies under the Ethicon umbrella, each of which specialize in different products. In 2008 J&J announced it would acquire Mentor Corporation for $1 billion and merge its operations into Ethicon. In March 2016, J&Js Ethicon business unit announced it would acquire NeuWave Medical, Inc. Ethicon Endo-Surgery was part of Ethicon Inc. until 1992, when it became a separate corporate entity under the J&J umbrella. In October 2010 J&J acquired Crucell for $2.4 billion. The following is an illustration of the company's structure, maintained though a number of mergers & acquisitions (this is not a comprehensive list): Baby Care Skin & Hair Care Wound Care and Topicals Oral Health Care Women<U+0092>s Health Over-The-Counter Medicines Nutritionals Advanced Sterilization Products Animas Corporation Biosense Webster Codman & Shurteff, Inc. DePuy Mitek, Inc. DePuy Orthopaedics, Inc. DePuy Spine, Inc Mentor Acclarent NeuWave Medical, Inc Janssen Diagnostics BVBA Johnson & Johnson Vision Care, Inc. LifeScan, Inc. Ethicon Endo-Surgery Covagen Janssen-Cilag Aragon Pharmaceuticals, Inc. Alios BioPharma, Inc. Novira Therapeutics, Inc. Janssen R&D LLC Janssen Healthcare Innovation Janssen Biotech, Inc. Ortho Biotech Inc. RespiVert Janssen Therapeutics Janssen Diagnostics Janssen Scientific Affairs Crucell Janssen-Ortho McNeil Nutritional LLC Mead Johnson(Sold to Johnson & Johnson) Current members of the board of directors of Johnson & Johnson are: Mary Sue Coleman, James G. Cullen, Dominic Caruso, Michael M.E. Johns, Ann Dibble Jordan, Arnold G. Langbo, Susan L. Lindquist, Leo F. Mullin, William Perez, Steven S. Reinemund, David Satcher, and William C. Weldon. Sandi Peterson has served as Group Worldwide Chairman since 2012. On top of Alex Gorsky and Sandi Peterson, current members of Executive Committees of Johnson & Johnson are: Dominic Caruso, Peter Fasolo, Paul Stoffels, and Michael Sneed. The company has historically been located on the Delaware and Raritan Canal in New Brunswick. The company considered moving its headquarters out of New Brunswick in the 1960s, but decided to stay in the town after city officials promised to gentrify downtown New Brunswick by demolishing old buildings and constructing new ones. While New Brunswick lost at least one historic edifice (the inn where Rutgers University began) to the redevelopment, the gentrification did attract people back to New Brunswick. Johnson & Johnson hired Henry N. Cobb from Pei Cobb Freed & Partners to design an addition to its headquarters. The white tower in a park across the railroad tracks from the older portion of the headquarters in one of tallest buildings in New Brunswick. The stretch of Delaware and Raritan canal by the company's headquarters was replaced by a stretch of Route 18 in the late 1970s, after a lengthy dispute. In 2002, the company released its plan of setting up Asia-Pacific information technology headquarters in New South Wales within five years. The company's business is divided into three major segments, Pharmaceuticals, Medical Devices, and Consumer Products. In 2015, these segments contributed 44.9%, 35.9%, and 19.3%, respectively, of the company's total revenues. The company's major franchises in the Pharmaceutical segment include Immunology, Neuroscience, Infectious Disease, and Oncology. Immunology products include the anti-tumor necrosis factor antibodies Remicade (infliximab), and Simponi (golimumab) used for the treatment of autoimmune diseases, including rheumatoid arthritis, Crohn's disease (Remicade only), ulcerative colitis, ankylosing spondylitis, and other disorders. In 2013, these two products accounted for 29% of Johnson and Johnson's pharmaceutical revenues, and 11.3% of the company's total revenues. A third immunology product, Stelara (ustekinumab), targets interleukin-12 and interleukin-23 and is used for the treatment of psoriasis. Key infectious diseases products include Incivio (telaprevir), a hepatitis C protease inhibitor; Intelence (Etravirine), a non-nucleoside HIV polymerase inhibitor; and Prezista (darunavir), an HIV protease inhibitor. Telaprevir sales are expected to decline due to the recent approval of treatment regimens that are more efficacious and much better tolerated. Etravirine and darunavir are notable for their high barriers to resistance development. Darunavir in combination with HIV polymerase inhibitors is recommended as a first line treatment option for treatment naive persons with HIV infection but etravirine is approved only for use in treatment-experienced patients, owing in part to its requirement for twice-daily dosing. The company's CNS products include the ADHD drug Concerta (methylphenidate extended release), and the long-acting injectable antipsychotics Invega Sustenna (paliperidone palmitate) and Risperdal Consta (risperidone). Invega Sustenna and Risperdal Consta were the first widely utilized long-acting depot injections for the treatment of schizophrenia. Designed to address the issue of poor patient compliance with oral therapy, they are administered by intramuscular injection at intervals of 2 weeks and one month, respectively. Only minimal improvements in outcomes relative to the oral versions of these drugs were observed in the clinical trial setting, but some evidence suggests that the advantages of long-acting injections in clinical practice may be greater than is readily demonstrated in the environment of a clinical trial. Oncology products include Velcade (bortezomib), for the treatment of multiple myeloma and mantle cell lymphoma and Zytiga (abiraterone), an androgen antagonist for the treatment of prostate cancer. In clinical trials, abiraterone treatment was associated with a 4.6 to 5.2 survival advantage when used either before or after chemotherapy with platinum based drugs. On December 31, 2012, the Food and Drug Administration approved Sirturo (bedaquiline), a Johnson & Johnson tuberculosis drug that is the first new medicine to fight the infection in more than forty years. Sectors in which the company is active include: Sectors in which the company is active include: Johnson & Johnson has set several positive goals to keep the company environmentally friendly and was ranked third among the United States's largest companies in Newsweek's "Green Rankings". Some examples are the reduction in water use, waste, and energy use and an increased level of transparency. Johnson & Johnson agreed to change its packaging of plastic bottles used in the manufacturing process, switching their packaging of liquids to non-polycarbonate containers. The corporation is working with the Climate Northwest Initiative and the EPA National Environmental Performance Track program. As a member of the national Green Power Partnership, Johnson & Johnson operates the largest solar power generator in Pennsylvania at its site in Spring House, PA. On September 29, 1982, a "Tylenol scare" began when the first of seven individuals died in metropolitan Chicago, after ingesting Extra Strength Tylenol that had been deliberately laced with cyanide. Within a week, the company pulled 31 million bottles of capsules back from retailers, making it one of the first major recalls in American history. The incident led to reforms in the packaging of over-the-counter substances and to federal anti-tampering laws. The case remains unsolved and no suspects have been charged. Johnson & Johnson's quick response, including a nationwide recall, was widely praised by public relations experts and the media and was the gold standard for corporate crisis management. On April 30, 2010, McNeil Consumer Healthcare, a subsidiary of Johnson and Johnson, voluntarily recalled 43 over-the-counter children's medicines, including Tylenol, Tylenol Plus, Motrin, Zyrtec and Benadryl. The recall was conducted after a routine inspection at a manufacturing facility in Fort Washington, Pennsylvania, United States revealed that some "products may not fully meet the required manufacturing specifications". Affected products may contain a "higher concentration of active ingredients" or exhibit other manufacturing defects. Products shipped to Canada, Dominican Republic, Guam, Guatemala, Jamaica, Puerto Rico, Panama, Trinidad and Tobago, the United Arab Emirates, Kuwait and Fiji were included in the recall. In a statement, Johnson & Johnson said "a comprehensive quality assessment across its manufacturing operations" was underway. A dedicated website was established by the company listing affected products and other consumer information. On August 2009, 2010, DePuy, a subsidiary of American giant Johnson & Johnson, recalled its ASR (articular surface replacement) hip prostheses from the market. DePuy said the recall was due to unpublished National Joint Registry data showing a 12% revision rate for resurfacing at five years and an ASR XL revision rate of 13%. All hip prostheses fail in some patients, but it is expected that the rate will be about 1% a year. Pathologically, the failing prosthesis had several effects. Metal debris from wear of the implant led to a reaction that destroyed the soft tissues surrounding the joint, leaving some patients with long term disability. Ions of cobalt and chromium<U+0097>the metals from which the implant was made<U+0097>were also released into the blood and cerebral spinal fluid in some patients. In March 2013, a jury in Los Angeles ordered Johnson & Johnson to pay more than $8.3 million in damages to a Montana man in the first of more than 10,000 lawsuits pending against the company in connection with the now-recalled DePuy hip. Some lawyers and industry analysts have estimated that the suits ultimately will cost Johnson & Johnson billions of dollars to resolve. In 2010 and 2011, Johnson & Johnson voluntarily recalled some over-the-counter products including Tylenol due to an odor caused by tribromoanisole. In this case, 2,4,6-tribromophenol was used to treat wooden pallets on which product packaging materials were transported and stored. In 2010 a group of shareholders sued the board for allegedly failing to take action to prevent serious failings and illegalities since the 1990s, including manufacturing problems, bribing officials, covering up adverse effects and misleading marketing for unapproved uses. The judge initially dismissed the case in September 2011, but allowed the plaintiffs opportunity to refile at a later time. In 2012 Johnson and Johnson proposed a settlement with the shareholders, whereby the company would institute new oversight, quality and compliance procedures binding for five years. Juries in several US states have found J&J guilty of concealing the adverse effects of Janssen Pharmaceuticals' antipsychotic medication Risperdal, produced by its unit, in order to promote it to doctors and patients as better than cheaper generics, and of falsely marketing it for treating patients with dementia. States that have awarded damages include Texas ($158 million), South Carolina ($327 million), Louisiana ($258 million), and most notably Arkansas ($1.2 billion). In 2010, the United States Department of Justice joined a whistleblowers suit accusing the company of illegally marketing Risperdal through Omnicare, the largest company supplying pharmaceuticals to nursing homes. The allegations include that J&J were warned by the U.S. Food and Drug Administration (FDA) not to promote Risperdal as effective and safe for elderly patients, but they did so, and that they paid Omnicare to promote the drug to care home physicians. The settlement was finalized on November 4, 2013, with J&J agreeing to pay a penalty of around $2.2 billion, "including criminal fines and forfeiture totaling $485 million and civil settlements with the federal government and states totaling $1.72 billion." Johnson & Johnson has also been subject to congressional investigations related to payments given to psychiatrists to promote its products and ghost write articles, notably Joseph Biederman and his pediatric bipolar disorder research unit. In 2011 J&J settled litigation brought by the US Securities and Exchange Commission under the Foreign Corrupt Practices Act; J&J paid around $70M in disgorgement and fines. J&J's employees had given kickbacks and bribes to doctors in Greece, Poland, Romania to obtain business selling drugs and medical devices, and had bribed officials in Iraq to win contracts under the Oil for Food program. J&J fully cooperated with the investigation once the problems came to light. Johnson & Johnson registered the Red Cross as a U.S. trademark for "medicinal and surgical plasters" in 1905 and has used the design since 1887. The Geneva Conventions, which reserved the Red Cross emblem for specific uses, were first approved in 1864 and ratified by the United States in 1882; however, the emblem was not protected by U.S. law for the use of the American Red Cross and the U.S. military until after Johnson & Johnson had obtained its trademark. A clause in this law (now 18 U.S.C. 706) permits this pre-existing uses of the Red Cross to continue. A declaration made by the U.S. upon its ratification of the 1949 Geneva Conventions includes a reservation that pre-1905 U.S. domestic uses of the Red Cross, such as Johnson & Johnson's, would remain lawful as long as the cross is not used on "aircraft, vessels, vehicles, buildings or other structures, or upon the ground," i.e. uses which could be confused with its military uses. This means that the U.S. did not agree to any interpretation of the 1949 Geneva Conventions that would overrule Johnson & Johnson's trademark. The American Red Cross continues to recognize the validity of Johnson & Johnson's trademark. In August 2007, Johnson & Johnson filed a lawsuit against the American Red Cross (ARC), demanding that the charity halt the use of the red cross symbol on products it sells to the public, though the company takes no issue with the charity's use of the mark for non-profit purposes. In May 2008, the judge in the case dismissed most of Johnson & Johnson's claims, and a month later the two organizations announced a settlement had been reached in which both parties would continue to use the symbol. Since 2003, Johnson & Johnson and Boston Scientific have both claimed that the other had infringed on their patents covering heart stent medical devices. The litigation was settled when Boston Scientific agreed to pay $716 million to Johnson & Johnson in September 2009 and an additional $1.73 billion in February 2010. Their dispute was renewed in 2014, now on the grounds of a contract dispute. In 2007, Johnson & Johnson sued Abbott Laboratories over the development and sale of the arthritis drug Humira. Johnson & Johnson claimed that Abbott used technology patented by New York University and licensed exclusively to Johnson & Johnson's Centocor division to develop Humira. Johnson & Johnson won the court case, and in 2009 Abbott was ordered to pay Johnson & Johnson $1.17 billion in lost revenues and $504 million in royalties. The judge also added $175.6 million in interest to bring the total to $1.84 billion. This was the largest patent-infringement award in U.S. history until the 2013 decision against Teva in favor of Takeda and Pfizer for over 2.1 billion dollars. Abbott has since successfully reversed the verdict at appeal. In February 2016, J&J was ordered to pay $72 million in damages to the family of Jackie Fox, a 62-year-old woman who died of ovarian cancer in 2015. She had used Johnson's Baby Powder for many years. J&J claimed that the safety of cosmetic talc is supported by decades of scientific evidence and it plans to appeal the verdict. The British charity, Ovacome was quoted as saying that while there were 16 studies which showed that using talc increased the risk of ovarian cancer by around a third, and a 2013 review of US studies had similar results for genital, but not general, talcum powder use they were not convinced that the results were reliable. Furthermore, Ovacome said, "Ovarian cancer is a rare disease, and increasing a small risk by a third still gives a small risk." Coordinates: 40°29'55<U+2033>N 74°26'37<U+2033>W<U+FEFF> / <U+FEFF>40.49861°N 74.44361°W<U+FEFF> / 40.49861; -74.44361
The underlying java functions are invoked to an-notate text into words & sentence. Sentence Annotation and Word Annotation are performed on the merged text. The annotation is also applied on the location and person to identify the location present in the data and the people present in the data.
t = Sys.time()
sent_annot = Maxent_Sent_Token_Annotator()
word_annot = Maxent_Word_Token_Annotator()
loc_annot = Maxent_Entity_Annotator(kind = "location")
person_annot = Maxent_Entity_Annotator(kind = "person")
annot.l1 = NLP::annotate(text, list(sent_annot,word_annot,loc_annot,person_annot))
class(annot.l1) # view info & structure in this object
## [1] "Annotation" "Span"
head(annot.l1) # view top few rows
## id type start end features
## 1 sentence 1 136 constituents=<<integer,20>>
## 2 sentence 138 253 constituents=<<integer,22>>
## 3 sentence 255 434 constituents=<<integer,32>>
## 4 sentence 436 562 constituents=<<integer,20>>
## 5 sentence 564 644 constituents=<<integer,15>>
## 6 sentence 646 743 constituents=<<integer,16>>
text_doc <- AnnotatedPlainTextDocument(text, annot.l1)
head(sents(text_doc), 2) # alternately, using the pipe operator, sents(text1_doc) %>% head(2)
## [[1]]
## [1] "Johnson" "&" "Johnson" "is"
## [5] "an" "American" "multinational" "medical"
## [9] "devices" "," "pharmaceutical" "and"
## [13] "consumer" "packaged" "goods" "manufacturer"
## [17] "founded" "in" "1886" "."
##
## [[2]]
## [1] "Its" "common" "stock" "is" "a"
## [6] "component" "of" "the" "Dow" "Jones"
## [11] "Industrial" "Average" "and" "the" "company"
## [16] "is" "listed" "among" "the" "Fortune"
## [21] "500" "."
head(words(text_doc), 10) # alternately, words(text1_doc) %>% head(10)
## [1] "Johnson" "&" "Johnson" "is"
## [5] "an" "American" "multinational" "medical"
## [9] "devices" ","
Location and the name of the persons are common nouns. Therefore, applying ‘openNLP’ and ‘sapply’ function the different names and locations found in the data is displayed.
# homebrewed func below will extract these different annotation kinds
# Extract entities from an AnnotatedPlainTextDocument
entities <- function(doc, kind) {
s <- doc$content
a <- annotations(doc)[[1]]
if(hasArg(kind)) {
k <- sapply(a$features, # in the anootated doc's features column
`[[`, # for every subsetted element
"kind") # ID the kind of annotation present
s[a[k == kind]]
} else {
s[a[a$type == "entity"]]
}
} # func ends
# now apply this entities() func we just defined to text1_doc
johnson_persons = entities(text_doc, kind = "person")
johnson_locations = entities(text_doc, kind = "location")
##Identifying the unique locations and persons in the data
all_places = unique(johnson_locations) # view contents of this obj
all_names = unique(johnson_persons) # view contents of this obj
#Displaying the person and places in the data with the run time function
all_places
## [1] "New Brunswick" "New Jersey" "Skillman"
## [4] "Schaffhausen" "Switzerland" "Belgium"
## [7] "the Netherlands" "Congo" "United States"
## [10] "Philadelphia" "Edinburgh" "Delaware"
## [13] "Route" "New South Wales" "Pennsylvania"
## [16] "Spring House" "Chicago" "Fort Washington"
## [19] "Canada" "Dominican Republic" "Guam"
## [22] "Guatemala" "Jamaica" "Puerto Rico"
## [25] "Panama" "Trinidad" "United Arab Emirates"
## [28] "Kuwait" "Los Angeles" "Montana"
## [31] "South Carolina" "Louisiana" "Arkansas"
## [34] "Greece" "Poland" "Romania"
## [37] "Iraq" "Boston"
all_names
## [1] "Johnson" "Joseph Lister"
## [3] "Robert Wood Johnson" "James Wood Johnson"
## [5] "Edward Mead Johnson" "James"
## [7] "Wood Johnson" "Mary Lea Johnson Richards"
## [9] "Jamie Johnson" "Rich"
## [11] "Robert Lincoln McNeil" "Robert L. McNeil"
## [13] "McNeil Consumer Healthcare" "Paul Janssen"
## [15] "Richter" "Johnson Pharmaceutical Research"
## [17] "Paul Janssen Research" "Johnson Medical"
## [19] "George" "Johnson Vision Care"
## [21] "Inc. LifeScan" "Mary Sue Coleman"
## [23] "James G. Cullen" "Dominic Caruso"
## [25] "Michael M.E." "Ann Dibble Jordan"
## [27] "Arnold" "Langbo"
## [29] "Susan L. Lindquist" "Leo F. Mullin"
## [31] "William Perez" "Steven S. Reinemund"
## [33] "David Satcher" "William C. Weldon"
## [35] "Peterson" "Alex Gorsky"
## [37] "Sandi Peterson" "Peter Fasolo"
## [39] "Paul Stoffels" "Michael Sneed"
## [41] "Rutgers University" "Henry N. Cobb"
## [43] "Risperdal" "Joseph Biederman"
## [45] "Abbott Laboratories" "Abbott"
## [47] "Jackie Fox"
Sys.time() - t
## Time difference of 10.24959 secs
Now lets geocode the places in the data and view them before plotting them on the map
all_places_geocoded <- geocode(all_places) #[1:10]
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20Brunswick&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20Jersey&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Skillman&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Schaffhausen&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Switzerland&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Belgium&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=the%20Netherlands&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Congo&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20States&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Philadelphia&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Edinburgh&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Delaware&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Route&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=New%20South%20Wales&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Pennsylvania&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Spring%20House&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Chicago&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Fort%20Washington&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Canada&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Dominican%20Republic&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Guam&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Guatemala&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Jamaica&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Puerto%20Rico&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Panama&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Trinidad&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=United%20Arab%20Emirates&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Kuwait&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Los%20Angeles&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Montana&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=South%20Carolina&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Louisiana&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Arkansas&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Greece&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Poland&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Romania&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Iraq&sensor=false
## .
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false
all_places_geocoded # view contents of this obj
## lon lat
## 1 -74.451819 40.486216
## 2 -74.405661 40.058324
## 3 -74.714682 40.420116
## 4 8.638049 47.695890
## 5 8.227512 46.818188
## 6 4.469936 50.503887
## 7 5.291266 52.132633
## 8 15.827659 -0.228021
## 9 -95.712891 37.090240
## 10 -75.165222 39.952584
## 11 -3.188267 55.953252
## 12 -75.527670 38.910832
## 13 59.810477 35.721086
## 14 146.921099 -31.253218
## 15 -77.194525 41.203322
## 16 -75.227676 40.185386
## 17 -87.629798 41.878114
## 18 -77.023031 38.707338
## 19 -106.346771 56.130366
## 20 -70.162651 18.735693
## 21 144.793731 13.444304
## 22 -90.230759 15.783471
## 23 -77.297508 18.109581
## 24 -66.590149 18.220833
## 25 -80.782127 8.537981
## 26 -104.500541 37.169463
## 27 53.847818 23.424076
## 28 47.481766 29.311660
## 29 -118.243685 34.052234
## 30 -110.362566 46.879682
## 31 -81.163725 33.836081
## 32 -91.962333 30.984298
## 33 -91.831833 35.201050
## 34 21.824312 39.074208
## 35 19.145136 51.919438
## 36 24.966760 45.943161
## 37 43.679291 33.223191
## 38 -71.058880 42.360082
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.