SUBMITTED BY: G SAI SWAPNA PG ID: 71710101

Step 1: Choose the company

The Walt Disney Company(popularly known as Disney) is one of the fortune 500 companies in the world. It is a multinational mass media and entertainment conglomerate.

Now let us perform text analytics on The Walt Disney Company wikipedia page and try to find some meaningful insights.

Step 2: Extract content from the wikipedia page

First load the required packages.

rm(list=ls())
library(NLP)
library(openNLP)
library(leaflet)

## Warning: package 'leaflet' was built under R version 3.3.3

library(rvest)

## Loading required package: xml2

library(tm)

## Warning: package 'tm' was built under R version 3.3.3

library(openNLPmodels.en)
library(magrittr)

library(ggmap)

## Warning: package 'ggmap' was built under R version 3.3.3

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## 
## Attaching package: 'ggmap'

## The following object is masked from 'package:magrittr':
## 
##     inset

Read the text from the wikipedia page of Disney and then extract only paragraphs from the page.

wiki_page=read_html("https://en.wikipedia.org/wiki/The_Walt_Disney_Company")
text=html_text(html_nodes(wiki_page,'p'))  #extracting pragraphs
text=text[text!=""] #removing empty paragraphs
head(text,2)

## [1] "The Walt Disney Company, commonly known as Disney, is an American diversified multinational mass media and entertainment conglomerate, headquartered at the Walt Disney Studios in Burbank, California. It is the world's second largest media conglomerate in terms of revenue, after Comcast.[3] Disney was founded on October 16, 1923 <U+0096> by brothers Walt Disney and Roy O. Disney <U+0096> as the Disney Brothers Cartoon Studio, and established itself as a leader in the American animation industry before diversifying into live-action film production, television, and theme parks. The company also operated under the names The Walt Disney Studio and then Walt Disney Productions. Taking on its current name in 1986, it expanded its existing operations and also started divisions focused upon theater, radio, music, publishing, and online media."                                                                                                                
## [2] "In addition, Disney has since created corporate divisions in order to market more mature content than is typically associated with its flagship family-oriented brands. The company is best known for the products of its film studio, Walt Disney Studios, which is today one of the largest and best-known studios in American cinema. Disney's other three main divisions are Walt Disney Parks and Resorts, Disney Media Networks, and Disney Consumer Products and Interactive Media.[4] Disney also owns and operates the ABC broadcast television network; cable television networks such as Disney Channel, ESPN, A+E Networks, and Freeform; publishing, merchandising, music, and theatre divisions; and owns and licenses 14 theme parks around the world. The company has been a component of the Dow Jones Industrial Average since May 6, 1991. Mickey Mouse, an early and well-known cartoon creation of the company, is a primary symbol and mascot for Disney."

Observe in the above output, there are some references like [3],[4] in the paragraphs. Lets remove the references in the paragraphs as they are irrelevant data to us.

text=gsub("\\[[0-9]]|\\[[0-9][0-9]]|\\[[0-9][0-9][0-9]]","",text)
str(text)

##  chr [1:72] "The Walt Disney Company, commonly known as Disney, is an American diversified multinational mass media and entertainment conglo"| __truncated__ ...

There are 72 documents. Lets just concatenate all of them to one single document.

text=paste(text,collapse = " ")
s=as.String(text)

STEP 3: Extracting locations and persons

Lets find the locations and persons mentioned in the page. Do the sentence and word annotation followed by entity annotation.

word_annotator=Maxent_Word_Token_Annotator()
sent_annotator=Maxent_Sent_Token_Annotator()
pos_tag=Maxent_POS_Tag_Annotator()
loc_annotator=Maxent_Entity_Annotator(kind="location")
per_annotator=Maxent_Entity_Annotator(kind="person")

Determine the locations and record time taken to annotate the document.

t=Sys.time()
loc=NLP::annotate(s,list(sent_annotator,word_annotator,loc_annotator))
T=Sys.time()
T-t

## Time difference of 14.3866 secs

l=sapply(loc$features,'[[',"kind")
location=s[loc[l=="location"]]
head(location)

## [1] "Burbank"     "California"  "Kansas City" "Missouri"    "Hollywood"  
## [6] "California"

str(unique(location))

##  chr [1:35] "Burbank" "California" "Kansas City" "Missouri" ...

More than 10 seconds time is taken to perform the location entity annotation on the document.

There are 35 unique locations mentioned in the document.

Determine the person names and record the time taken to annotate the document.

t=Sys.time()
per=NLP::annotate(s,list(sent_annotator,word_annotator,per_annotator))
T=Sys.time()
T-t

## Time difference of 13.88827 secs

p=sapply(per$features,'[[',"kind")
person=s[per[p=="person"]]
head(person)

## [1] "Roy"                 "A+E Networks"        "Alice"              
## [4] "Virginia Davis"      "Roy"                 "Margaret J. Winkler"

str(unique(person))

##  chr [1:95] "Roy" "A+E Networks" "Alice" "Virginia Davis" ...

str(person)

##  chr [1:142] "Roy" "A+E Networks" "Alice" "Virginia Davis" ...

More than 10 seconds time is taken to perform the person entity annotation on the document.

There are 95 unique names of persons and total of 142 names mentioned in the document. So there are some names which are getting repeated.

STEP 4: Plot the locations in the map.

Take the unique locations

unique_loc=unique(location)
message = FALSE
t=Sys.time()
loc_geocode=geocode(unique_loc)
T=Sys.time()
loc_geocode$PLACE=unique_loc

T-t

## Time difference of 32.57875 secs

head(loc_geocode)

##          lon      lat         PLACE
## 1 -118.30897 34.18084       Burbank
## 2 -119.41793 36.77826    California
## 3  -94.57857 39.09973   Kansas City
## 4  -91.83183 37.96425      Missouri
## 5 -118.32866 34.09281     Hollywood
## 6  -74.00594 40.71278 New York City

Time taken to apply geocode to all the locations is more than 20 seconds.

The map below shows the locations mentioned in the wikipedia page of Disney.

leaflet(data=loc_geocode[1:nrow(loc_geocode),])%>%addTiles()%>%
  addMarkers(~lon,~lat,popup=~as.character(PLACE))

Click on the location to get the name of the location. ####From the map you can observe that out oF 35 unique locations most of them are in United States.

STEP 5: Extract all numbers

The numbers which we have are parts of dates, some are income and some are percentages related to ownership and revenue.

library(stringr)

Extract all the amounts.

amounts=str_extract_all(s,"\\$[0-9]+(\\,|.)?[0-9]+(\\s)?(million|trillion|billion)?")
amounts

## [[1]]
##  [1] "$1,500 "       "$20 million"   "$42.6 million" "$45 million"  
##  [5] "$193 million"  "$300 million"  "$200 million"  "$38 million"  
##  [9] "$100 million"  "$10.4 million" "$100 "         "$200 million" 
## [13] "$1.2 billion"  "$54 billion"   "$7.4 billion"  "$4.24 billion"
## [17] "$4.4 billion"  "$4.06 billion" "$500 million"

There are numbers which define the percent of ownership or revenue or income drop. ##Extract percents Lets extract those numbers

percent=str_extract_all(s,"[0-9]+(%|\\s(percent))")
percent

## [[1]]
##  [1] "60%"        "40%"        "90%"        "70%"        "1%"        
##  [6] "7 percent"  "20%"        "28 percent" "22 percent" "23 percent"
## [11] "51 percent" "43 percent" "9 percent"  "39 percent" "45%"       
## [16] "1%"         "7%"         "1 percent"

Extract few words before and after these percents to know about what they are related to.

str_extract_all(s,"\\w+\\s\\w+\\s[0-9]+(%|\\s(percent))\\s\\w+\\s\\w+")

## [[1]]
##  [1] "Roy owned 40% of WD"                        
##  [2] "were generating 70% of Disney"              
##  [3] "revenues by 20% every year"                 
##  [4] "merchandising became 28 percent of total"   
##  [5] "revenues contributed 22 percent of revenues"
##  [6] "drop by 23 percent and had"                 
##  [7] "to keep 51 percent ownership of"            
##  [8] "Starwave and 43 percent of Infoseek"        
##  [9] "revenue of 9 percent and net"               
## [10] "income of 39 percent with ABC"              
## [11] "a surprising 45% of Disney"                 
## [12] "shareholder at 7% and a"                    
## [13] "owned roughly 1 percent of all"

So these percents were talking about revenues, ownership, income.

Extract dates

dates= str_extract_all(s,"(January|February|March|April|May|June|July|August|September|October|November|December)?(\\s)?([0-9][0-9])?(\\,)?(\\s)?[1-2][0-9][0-9][0-9]")
head(dates[[1]],8)

## [1] "October 16, 1923" " 1986"            ", 1991"          
## [4] " 1923"            " 1923"            "January 1926"    
## [7] "February 1928"    " 1928"

So we have extracted all the dates and years mentioned in the document.

Extract numbers associated with employees

Lets extract numbers associated with employees.

str_extract_all(s,"[0-9]+\\,?[0-9]+\\semployees")

## [[1]]
## [1] "550 employees"   "400 employees"   "4,000 employees"

We dont know what these numbers are indicating so lets just try to extract few words before these numbers which might give us an idea of these numbers.

str_extract_all(s,"\\w+\\s\\w+\\s[0-9]+\\,?[0-9]+\\semployees")

## [[1]]
## [1] "of its 550 employees"       "laying off 400 employees"  
## [3] "laying off 4,000 employees"

TEXT ANALYSIS OF THE WALT DISNEY COMPANY WIKIPEDIA PAGE