This is an R Markdown document. The encyclopedia was scrapped to find few insights of Ford Motors. The objective is to scrape all the persons and locations for the Ford Company on the encyclopedia. See https://en.wikipedia.org/wiki/Ford_Motor_Company .
So lets start scraping the people mentioned in that particular page.
page = read_html('https://en.wikipedia.org/wiki/Ford_Motor_Company')
text = html_text(html_nodes(page,'p'))
The list of all people extracted from the encyclopedia page is:
all_persons
## [1] "Henry Ford" "Lio Ho"
## [3] "Henry Ford Company" "John and"
## [5] "Carl Benz" "C"
## [7] "Ford" "Model"
## [9] "Lincoln" "Aston Martin"
## [11] "Bill Ford" "Mark Fields"
## [13] "Clay Ford Jr." "Jacques Nasser"
## [15] "Henry Ford II" "Gerald Forsythe"
## [17] "Kevin Kalkhoven" "Jim Padilla"
## [19] "Alan Mulally" "Gephardt"
## [21] "Stephen Butler" "Ellen Marram"
## [23] "Kimberly Casiano" "Edsel Ford II"
## [25] "William" "Anthony F. Earley"
## [27] "James P. Hackett" "John L. Thornton"
## [29] "James H. Hance" "W. Helman IV"
## [31] "Jon M. Huntsman" "John C. Lechleiter"
## [33] "Gerald L. Shaheen" "Consequently"
## [35] "Lincoln Continental" "Lincoln LS."
## [37] "Lincoln Navigator" "Falcon"
## [39] "Ward" "Henry"
## [41] "Ford-New Holland" "Richard Petty Motorsports"
## [43] "Ford Torino" "Kurt Busch"
## [45] "Marcus Gr<U+00F6>nholm" "Ken Block"
## [47] "Brian Deegan" "Jerry Titus"
## [49] "Parnelli Jones" "George Follmer"
## [51] "Bud Moore Engineering." "John Jones"
## [53] "Scott Pruett" "Dorsey Schroeder."
## [55] "Tommy Kendall" "Paul Gentilozzi"
## [57] "Miller Cup" "Joe Foster"
## [59] "Steve Maxwell" "Dave Pericak"
## [61] "Vaughn Gittin Jr," "John Force Racing"
## [63] "John Force" "Tony Pedregon"
## [65] "Robert Hight" "Bob Tasca III"
## [67] "Johnson Controls-Saft" "Ford Escape Hybrid"
## [69] "Fields" "Clay Ford"
## [71] "Mark Jones"
Few observations on the list of people extracted:1. A character with “C” was considered as a person since the letter cannot be decided if it is a person/thing/no meaning.2. Henry Ford is repeated twice each individually and together once.3. All the names of the people are extracted correctly.
The list of all locations extracted from the encyclopedia page is:
all_places
## [1] "Dearborn" "Michigan" "Detroit"
## [4] "Japan" "United Kingdom" "China"
## [7] "Taiwan" "Thailand" "Turkey"
## [10] "Russia" "Mercury" "United States"
## [13] "Canada" "Mexico" "Middle East"
## [16] "Europe" "Mack Avenue" "Piquette Avenue"
## [19] "Highland Park" "Soviet Union" "Ford Americas"
## [22] "North America" "Washington D.C." "Americas"
## [25] "India" "Germany" "Brazil"
## [28] "Argentina" "Australia" "South Africa"
## [31] "Britain" "Belgium" "Spain"
## [34] "Dunton" "Essex" "Cologne"
## [37] "Genk" "Valencia" "Kocaeli"
## [40] "Set<U+00FA>bal" "Portugal" "Romania"
## [43] "Asia" "Malaysia" "Singapore"
## [46] "Hong Kong" "Philippines" "Ulsan"
## [49] "South Korea" "Korea" "Chennai"
## [52] "Israel" "Kuwait" "Egypt"
## [55] "Saudi Arabia" "South America" "Del Rey"
## [58] "Africa" "Samcor" "Southern Africa"
## [61] "Falcon" "Indonesia" "Autorama"
## [64] "Minato" "Tokyo" "New Zealand"
## [67] "Wiri" "Broadmeadows" "Melbourne"
## [70] "Sweden" "Flat Rock" "France"
## [73] "Netherlands" "Sterling" "Springwells"
## [76] "Cork" "Ireland" "Dagenham"
## [79] "England" "Leningrad" "New Holland"
## [82] "Mercury Montegos" "Stewart" "Monte Carlo"
## [85] "Le Mans" "1,2,3" "Indianapolis"
## [88] "Norway" "Greater Los Angeles" "Mexico City"
## [91] "Louisville" "Kentucky" "1E."
## [94] "Evansville" "Indiana" "Continental Europe"
Few observations on the list of locations 1. Few locations extracted doesnt actually represent a country like “1,2,3” 2. United states have been treated differently. For example, United states,Americas,Ford Americas were treated as countries seperately.
Ford Locations.
Note that the echo = FALSE
parameter was added to the code chunk to prevent printing of the R code that generated the plot.