Inc 5,000 Fastest Growing Companies Scrape
Using rvest and xml2 to Scrape Top 5000 Fastest Growing Companies Data
The website Inc. 5000 has a list of the top 5,000 fastest growing private companies from 2020. I will use xml2 and rvest to scrape the data from the website.
First, I’ll import the necessary packages for the scrape.
library(tidyverse)
library(tibble)
library(xml2)
library(rvest)Next, I’ll build a scraper function that will scrape each of the attributes from the page that I need for my dataset.
url <- 'https://www.inc.com/inc5000/2020'
scraper_func <- function(x) {
web_page <- xml2::read_html(x)
rank <- web_page %>% rvest::html_nodes("div.rank") %>%
html_text()
company <- web_page %>% rvest::html_nodes("div.company") %>%
html_text()
growth <- web_page %>% rvest::html_nodes("div.growth") %>%
html_text()
industry_1 <- web_page %>% rvest::html_nodes("div.industry") %>%
html_text()
industry <- industry_1[2:length(industry_1)]
state_1 <- web_page %>% rvest::html_nodes("div.state") %>%
html_text()
state <- state_1[2:length(state_1)]
city <- web_page %>% rvest::html_nodes("div.city") %>%
html_text()
df <- cbind(rank, company, growth, industry, state, city)
final_df <- as_tibble(df)
return(final_df)
}
inc_2020_scrape <- scraper_func(url)In working with this scrape, I found several instances where they had duplicated a number in their rank instead of going to the next number. As you can see in the image below, the rank is duplicated, but the company and all other information are different.
Because of this problem we actually have 5,004 companies in this dataset.
inc_2020 <- inc_2020_scrape[-1,]
inc_2020 <- inc_2020 %>%
mutate(rank = dplyr::row_number()) %>%
mutate(growth = str_replace(growth, '%','')) %>%
mutate(growth = str_replace(growth, ',','')) %>%
mutate(growth = as.integer(growth)/100)
names(inc_2020) <- c('Rank', 'Company', 'Growth', 'Industry', 'State', 'City')Now that we have our dataset, let’s preview it:
inc <- tibble::as_tibble(inc_2020)
inc## # A tibble: 5,004 x 6
## Rank Company Growth Industry State City
## <int> <chr> <dbl> <chr> <chr> <chr>
## 1 1 OneTrust 483. Software Georgia Atlanta
## 2 2 Create Music Group 468 Media Califor~ Los Ange~
## 3 3 Lovell Government Ser~ 409. Health Florida Pensacola
## 4 4 Avalon Healthcare Sol~ 260. Health Florida Tampa
## 5 5 ZULIE VENTURE INC 254. Telecommunications Texas Stafford
## 6 6 Hunt A Killer 205. Consumer Products & S~ Maryland Baltimore
## 7 7 Case Energy Partners 179. Energy Texas Dallas
## 8 8 Nationwide Mortgage B~ 164. Financial Services New York Melville
## 9 9 Paxon Energy 151. Energy Califor~ Pleasant~
## 10 10 Inspire11 139. Business Products & S~ Illinois Chicago
## # ... with 4,994 more rows
As you can see, with a rvest and xml2 in just a few lines of code, we can easily and efficiently scrape 5,000+ rows of data in seconds.