The Goal

To look at “revenue per employee” for a range of companies (platform and not) - and across time (i.e., across the age of the company). Which ones? from this list: https://www.zyxware.com/articles/4344/list-of-fortune-500-companies-and-their-websites. How? By getting data on two metrics (‘revenue’ and ‘headcount’) for each company, for each year from 2007 to 2019. Where? data from macrotrends.net.

Setup

First, let’s set up some packages and libraries, and some initial values.

# install.packages("rvest")
# install.packages("hablar")
library(rvest)
library(dplyr)
library(hablar)
library(ggplot2)
library(tidyverse) # for drop_na(y), tidyr
library(janitor)

start.year <- 2007
end.year <- 2019
n.years <- (end.year - start.year) + 1

revenue.file <- "CompanyPerformance - revenue.csv"
headcount.file <- "CompanyPerformance - headcount.csv"

output <- data.frame(year=numeric(), value=numeric(), company=character(), metric = character())

Creating Company List and URL table

A Google search of the form ‘company revenue macrotrends’ returns a URL, and these are collected in the Google Sheet https://docs.google.com/spreadsheets/d/1fEl_9LY552BX9-C0G0FtwczkSUi87vsC6JCxkzx_UM8/edit#gid=1461480255, from which I’ve exported two local .csv files. Read these in, and drop companies for which we do not have results.

URL.table.revenue <- read.csv(file=revenue.file)
URL.table.headcount <- read.csv(file=headcount.file)

URL.table <- bind_rows(URL.table.revenue, URL.table.headcount) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% mutate_all(~ifelse(. %in% c("N/A", "null", ""), NA, .)) %>% na.omit()

Fetching the Data

Here is the main function for fetching the data. It is given a URL, and also the company name (eg Facebook) and the metric (eg revenue) that is available on that page. The data (for each company, metric pair) are extracted from the web page in the form of a list that alternates between year and value, so it is converted into a structured tabular form (year, value, company, metric).

fetch.data <- function(companyname, metricname, url) {
    webpage <- read_html(url) # read the page 
    html <- rvest::html_nodes(webpage, "thead+ thead th , #style-1 td")
    results <- rvest::html_text(html) # capture the data into results
    
    results <- results[1:(2*n.years)] # just want the n.years (year, value) pairs
    data.table <- data.frame(year = results[-1+2*(1:n.years)], value = results[2*(1:n.years)]) %>% convert(num(year)) %>% filter(year <= end.year & year >= start.year) %>% na.omit() # convert to data.frame 
    
    data.table$value <- gsub('[^a-zA-Z0-9.]','',data.table$value) # remove extraneous symbols in value
    data.table <- data.table %>% convert(num(value), num(year)) 
    
    data.table <- data.table %>% mutate(company = companyname, metric = metricname) # add the two columns with repeating values
}


# url = "https://www.macrotrends.net/stocks/charts/FB/facebook/revenue" # the webpage for the data  # 
# metricname = "revenue"
# companyname = "facebook"
# data.table %>% mutate_if(is.character,as.numeric) # 

Collecting the Revenue and Headcount Data

Call the fetch.data function to collect the data for a set of rows in the URL.table

# Get the data for n rows (companies) in URL.table
nrow <- c(1:50, 57, 101:110) # 
companies <- c("Walmart", "Exxon Mobil", "Apple", "Berkshire Hathaway", "Amazon.com", "AT&T", "Chevron", "Ford Motor", "General Motors", "Alphabet", "JPMorgan Chase", "Verizon Communications", "Kroger", "General Electric",  "Microsoft", "Home Depot", "Boeing", "Wells Fargo", "Citigroup",  "Comcast", "IBM", "Target", "Walt Disney", "Facebook", "Caterpillar", "Delta Air Lines", "Uber", "Marriott", "Expedia", "Twitter", "eBay", "JD.com", "Alibaba", "Tencent", "LendingTree")

# first construct a blank data frame, then append data into it
output <- data.frame(year=numeric(), value=numeric(), company=character(), metric = character())

for (n in companies) { # or n in nrow
  company = n # or company = URL.table[n,"companyname"]
  url.rev = as.character(URL.table %>% filter(companyname == company & metric == "revenue") %>% select("url"))  # as.character
  url.head = as.character(URL.table %>% filter(companyname == company & metric == "headcount") %>% select("url"))   # 
n.revenue <- fetch.data(company, "revenue", url.rev)
n.headcount <- fetch.data(company, "headcount", url.head) 
  
output <- bind_rows(output, n.revenue, n.headcount) 
}

# output <- output %>% group_by(company, year) %>% mutate(revperemp = value(revenue) / value(headcount))

write.csv(output, file="outputdata.csv")

# y <- fetch.data("General Motors", "revenue",  "https://www.macrotrends.net/stocks/charts/GM/general-motors/revenue")

Visualization of Results

Note that the revenues are in millions ($) while employee count is regular numbers. So if the dots or lines meet, that represents average revenue per employee is $1 million. When the revenue curve (blue) is below the headcount curve (red) then average revenue per employee is less than $1 million, and so on.

df.eg = output %>% filter(company %in% companies) %>% group_by(company, metric)
# or company %in% URL.table[c(1:19,57, 101:110), "companyname"]
ggplot(df.eg, aes(x=year, y=value)) + geom_point(aes(shape=metric), size=0.75) + geom_line(aes(color=metric)) + facet_wrap(~company, scales = "free") + theme_classic() + theme(strip.text = element_text(size=15))

Archived

Example Calls for Data

An example call,

fb.revenue <- fetch.data("Facebook", "revenue", "https://www.macrotrends.net/stocks/charts/FB/facebook/revenue")

fb.headcount <- fetch.data("Facebook", "headcount", "https://www.macrotrends.net/stocks/charts/FB/facebook/number-of-employees")

Another example,

Of course we do not want to create a separate data object for each call. The next step is to just create one empty data object and then keep adding rows to it.

visualization