SUBMITTED BY: G SAI SWAPNA PG ID: 71710101
The Walt Disney Company(popularly known as Disney) is one of the fortune 500 companies in the world. It is a multinational mass media and entertainment conglomerate.
First load the required packages.
rm(list=ls())
library(NLP)
library(openNLP)
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.3.3
library(rvest)
## Loading required package: xml2
library(tm)
## Warning: package 'tm' was built under R version 3.3.3
library(openNLPmodels.en)
library(magrittr)
library(ggmap)
## Warning: package 'ggmap' was built under R version 3.3.3
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
##
## Attaching package: 'ggmap'
## The following object is masked from 'package:magrittr':
##
## inset
Read the text from the wikipedia page of Disney and then extract only paragraphs from the page.
wiki_page=read_html("https://en.wikipedia.org/wiki/The_Walt_Disney_Company")
text=html_text(html_nodes(wiki_page,'p')) #extracting pragraphs
text=text[text!=""] #removing empty paragraphs
head(text,2)
## [1] "The Walt Disney Company, commonly known as Disney, is an American diversified multinational mass media and entertainment conglomerate, headquartered at the Walt Disney Studios in Burbank, California. It is the world's second largest media conglomerate in terms of revenue, after Comcast.[3] Disney was founded on October 16, 1923 <U+0096> by brothers Walt Disney and Roy O. Disney <U+0096> as the Disney Brothers Cartoon Studio, and established itself as a leader in the American animation industry before diversifying into live-action film production, television, and theme parks. The company also operated under the names The Walt Disney Studio and then Walt Disney Productions. Taking on its current name in 1986, it expanded its existing operations and also started divisions focused upon theater, radio, music, publishing, and online media."
## [2] "In addition, Disney has since created corporate divisions in order to market more mature content than is typically associated with its flagship family-oriented brands. The company is best known for the products of its film studio, Walt Disney Studios, which is today one of the largest and best-known studios in American cinema. Disney's other three main divisions are Walt Disney Parks and Resorts, Disney Media Networks, and Disney Consumer Products and Interactive Media.[4] Disney also owns and operates the ABC broadcast television network; cable television networks such as Disney Channel, ESPN, A+E Networks, and Freeform; publishing, merchandising, music, and theatre divisions; and owns and licenses 14 theme parks around the world. The company has been a component of the Dow Jones Industrial Average since May 6, 1991. Mickey Mouse, an early and well-known cartoon creation of the company, is a primary symbol and mascot for Disney."
Observe in the above output, there are some references like [3],[4] in the paragraphs. Lets remove the references in the paragraphs as they are irrelevant data to us.
text=gsub("\\[[0-9]]|\\[[0-9][0-9]]|\\[[0-9][0-9][0-9]]","",text)
str(text)
## chr [1:72] "The Walt Disney Company, commonly known as Disney, is an American diversified multinational mass media and entertainment conglo"| __truncated__ ...
text=paste(text,collapse = " ")
s=as.String(text)
Lets find the locations and persons mentioned in the page. Do the sentence and word annotation followed by entity annotation.
word_annotator=Maxent_Word_Token_Annotator()
sent_annotator=Maxent_Sent_Token_Annotator()
pos_tag=Maxent_POS_Tag_Annotator()
loc_annotator=Maxent_Entity_Annotator(kind="location")
per_annotator=Maxent_Entity_Annotator(kind="person")
Determine the locations and record time taken to annotate the document.
t=Sys.time()
loc=NLP::annotate(s,list(sent_annotator,word_annotator,loc_annotator))
T=Sys.time()
T-t
## Time difference of 14.3866 secs
l=sapply(loc$features,'[[',"kind")
location=s[loc[l=="location"]]
head(location)
## [1] "Burbank" "California" "Kansas City" "Missouri" "Hollywood"
## [6] "California"
str(unique(location))
## chr [1:35] "Burbank" "California" "Kansas City" "Missouri" ...
Determine the person names and record the time taken to annotate the document.
t=Sys.time()
per=NLP::annotate(s,list(sent_annotator,word_annotator,per_annotator))
T=Sys.time()
T-t
## Time difference of 13.88827 secs
p=sapply(per$features,'[[',"kind")
person=s[per[p=="person"]]
head(person)
## [1] "Roy" "A+E Networks" "Alice"
## [4] "Virginia Davis" "Roy" "Margaret J. Winkler"
str(unique(person))
## chr [1:95] "Roy" "A+E Networks" "Alice" "Virginia Davis" ...
str(person)
## chr [1:142] "Roy" "A+E Networks" "Alice" "Virginia Davis" ...
Take the unique locations
unique_loc=unique(location)
message = FALSE
t=Sys.time()
loc_geocode=geocode(unique_loc)
T=Sys.time()
loc_geocode$PLACE=unique_loc
T-t
## Time difference of 32.57875 secs
head(loc_geocode)
## lon lat PLACE
## 1 -118.30897 34.18084 Burbank
## 2 -119.41793 36.77826 California
## 3 -94.57857 39.09973 Kansas City
## 4 -91.83183 37.96425 Missouri
## 5 -118.32866 34.09281 Hollywood
## 6 -74.00594 40.71278 New York City
Time taken to apply geocode to all the locations is more than 20 seconds.
The map below shows the locations mentioned in the wikipedia page of Disney.
leaflet(data=loc_geocode[1:nrow(loc_geocode),])%>%addTiles()%>%
addMarkers(~lon,~lat,popup=~as.character(PLACE))
Click on the location to get the name of the location. ####From the map you can observe that out oF 35 unique locations most of them are in United States.
The numbers which we have are parts of dates, some are income and some are percentages related to ownership and revenue.
library(stringr)
amounts=str_extract_all(s,"\\$[0-9]+(\\,|.)?[0-9]+(\\s)?(million|trillion|billion)?")
amounts
## [[1]]
## [1] "$1,500 " "$20 million" "$42.6 million" "$45 million"
## [5] "$193 million" "$300 million" "$200 million" "$38 million"
## [9] "$100 million" "$10.4 million" "$100 " "$200 million"
## [13] "$1.2 billion" "$54 billion" "$7.4 billion" "$4.24 billion"
## [17] "$4.4 billion" "$4.06 billion" "$500 million"
There are numbers which define the percent of ownership or revenue or income drop. ##Extract percents Lets extract those numbers
percent=str_extract_all(s,"[0-9]+(%|\\s(percent))")
percent
## [[1]]
## [1] "60%" "40%" "90%" "70%" "1%"
## [6] "7 percent" "20%" "28 percent" "22 percent" "23 percent"
## [11] "51 percent" "43 percent" "9 percent" "39 percent" "45%"
## [16] "1%" "7%" "1 percent"
dates= str_extract_all(s,"(January|February|March|April|May|June|July|August|September|October|November|December)?(\\s)?([0-9][0-9])?(\\,)?(\\s)?[1-2][0-9][0-9][0-9]")
head(dates[[1]],8)
## [1] "October 16, 1923" " 1986" ", 1991"
## [4] " 1923" " 1923" "January 1926"
## [7] "February 1928" " 1928"
So we have extracted all the dates and years mentioned in the document.
Lets extract numbers associated with employees.
str_extract_all(s,"[0-9]+\\,?[0-9]+\\semployees")
## [[1]]
## [1] "550 employees" "400 employees" "4,000 employees"
str_extract_all(s,"\\w+\\s\\w+\\s[0-9]+\\,?[0-9]+\\semployees")
## [[1]]
## [1] "of its 550 employees" "laying off 400 employees"
## [3] "laying off 4,000 employees"