Inc 5,000 Fastest Growing Companies Scrape

Using rvest and xml2 to Scrape Top 5000 Fastest Growing Companies Data

The website Inc. 5000 has a list of the top 5,000 fastest growing private companies from 2020. I will use xml2 and rvest to scrape the data from the website.

First, I’ll import the necessary packages for the scrape.

library(tidyverse)
library(tibble)
library(xml2)
library(rvest)

Next, I’ll build a scraper function that will scrape each of the attributes from the page that I need for my dataset.

url <- 'https://www.inc.com/inc5000/2020'

scraper_func <- function(x) {
  
  web_page <- xml2::read_html(x)
  
  rank <- web_page %>% rvest::html_nodes("div.rank") %>%
    html_text()
  company <- web_page %>% rvest::html_nodes("div.company") %>%
    html_text()
  growth <- web_page %>% rvest::html_nodes("div.growth") %>%
    html_text()
  
  industry_1 <- web_page %>% rvest::html_nodes("div.industry") %>%
    html_text()
  industry <- industry_1[2:length(industry_1)]
  
  state_1 <- web_page %>% rvest::html_nodes("div.state") %>%
    html_text()
  state <- state_1[2:length(state_1)]
  
  city <- web_page %>% rvest::html_nodes("div.city") %>%
    html_text()
  
  df <- cbind(rank, company, growth, industry, state, city)
  
  final_df <- as_tibble(df)

  return(final_df)

}

inc_2020_scrape <- scraper_func(url)

In working with this scrape, I found several instances where they had duplicated a number in their rank instead of going to the next number. As you can see in the image below, the rank is duplicated, but the company and all other information are different.

Because of this problem we actually have 5,004 companies in this dataset.

inc_2020 <- inc_2020_scrape[-1,]
inc_2020 <- inc_2020 %>%
  mutate(rank = dplyr::row_number()) %>%
  mutate(growth = str_replace(growth, '%','')) %>%
  mutate(growth = str_replace(growth, ',','')) %>%
  mutate(growth = as.integer(growth)/100)

names(inc_2020) <- c('Rank', 'Company', 'Growth', 'Industry', 'State', 'City')

Now that we have our dataset, let’s preview it:

inc <- tibble::as_tibble(inc_2020)
inc
## # A tibble: 5,004 x 6
##     Rank Company                Growth Industry               State    City     
##    <int> <chr>                   <dbl> <chr>                  <chr>    <chr>    
##  1     1 OneTrust                 483. Software               Georgia  Atlanta  
##  2     2 Create Music Group       468  Media                  Califor~ Los Ange~
##  3     3 Lovell Government Ser~   409. Health                 Florida  Pensacola
##  4     4 Avalon Healthcare Sol~   260. Health                 Florida  Tampa    
##  5     5 ZULIE VENTURE INC        254. Telecommunications     Texas    Stafford 
##  6     6 Hunt A Killer            205. Consumer Products & S~ Maryland Baltimore
##  7     7 Case Energy Partners     179. Energy                 Texas    Dallas   
##  8     8 Nationwide Mortgage B~   164. Financial Services     New York Melville 
##  9     9 Paxon Energy             151. Energy                 Califor~ Pleasant~
## 10    10 Inspire11                139. Business Products & S~ Illinois Chicago  
## # ... with 4,994 more rows

As you can see, with a rvest and xml2 in just a few lines of code, we can easily and efficiently scrape 5,000+ rows of data in seconds.