To look at “revenue per employee” for a range of companies (platform and not) - and across time (i.e., across the age of the company). Which ones? from this list: https://www.zyxware.com/articles/4344/list-of-fortune-500-companies-and-their-websites. How? By getting data on two metrics (‘revenue’ and ‘headcount’) for each company, for each year from 2007 to 2019. Where? data from macrotrends.net.
First, let’s set up some packages and libraries, and some initial values.
# install.packages("rvest")
# install.packages("hablar")
library(rvest)
library(dplyr)
library(hablar)
library(ggplot2)
library(tidyverse) # for drop_na(y), tidyr
library(janitor)
start.year <- 2007
end.year <- 2019
n.years <- (end.year - start.year) + 1
revenue.file <- "CompanyPerformance - revenue.csv"
headcount.file <- "CompanyPerformance - headcount.csv"
output <- data.frame(year=numeric(), value=numeric(), company=character(), metric = character())
A Google search of the form ‘company revenue macrotrends’ returns a URL, and these are collected in the Google Sheet https://docs.google.com/spreadsheets/d/1fEl_9LY552BX9-C0G0FtwczkSUi87vsC6JCxkzx_UM8/edit#gid=1461480255, from which I’ve exported two local .csv files. Read these in, and drop companies for which we do not have results.
URL.table.revenue <- read.csv(file=revenue.file)
URL.table.headcount <- read.csv(file=headcount.file)
URL.table <- bind_rows(URL.table.revenue, URL.table.headcount) %>% select(where(~ !(all(is.na(.)) | all(. == "")))) %>% mutate_all(~ifelse(. %in% c("N/A", "null", ""), NA, .)) %>% na.omit()
Here is the main function for fetching the data. It is given a URL, and also the company name (eg Facebook) and the metric (eg revenue) that is available on that page. The data (for each company, metric pair) are extracted from the web page in the form of a list that alternates between year and value, so it is converted into a structured tabular form (year, value, company, metric).
fetch.data <- function(companyname, metricname, url) {
webpage <- read_html(url) # read the page
html <- rvest::html_nodes(webpage, "thead+ thead th , #style-1 td")
results <- rvest::html_text(html) # capture the data into results
results <- results[1:(2*n.years)] # just want the n.years (year, value) pairs
data.table <- data.frame(year = results[-1+2*(1:n.years)], value = results[2*(1:n.years)]) %>% convert(num(year)) %>% filter(year <= end.year & year >= start.year) %>% na.omit() # convert to data.frame
data.table$value <- gsub('[^a-zA-Z0-9.]','',data.table$value) # remove extraneous symbols in value
data.table <- data.table %>% convert(num(value), num(year))
data.table <- data.table %>% mutate(company = companyname, metric = metricname) # add the two columns with repeating values
}
# url = "https://www.macrotrends.net/stocks/charts/FB/facebook/revenue" # the webpage for the data #
# metricname = "revenue"
# companyname = "facebook"
# data.table %>% mutate_if(is.character,as.numeric) #
Call the fetch.data function to collect the data for a set of rows in the URL.table
# Get the data for n rows (companies) in URL.table
nrow <- c(1:50, 57, 101:110) #
companies <- c("Walmart", "Exxon Mobil", "Apple", "Berkshire Hathaway", "Amazon.com", "AT&T", "Chevron", "Ford Motor", "General Motors", "Alphabet", "JPMorgan Chase", "Verizon Communications", "Kroger", "General Electric", "Microsoft", "Home Depot", "Boeing", "Wells Fargo", "Citigroup", "Comcast", "IBM", "Target", "Walt Disney", "Facebook", "Caterpillar", "Delta Air Lines", "Uber", "Marriott", "Expedia", "Twitter", "eBay", "JD.com", "Alibaba", "Tencent", "LendingTree")
# first construct a blank data frame, then append data into it
output <- data.frame(year=numeric(), value=numeric(), company=character(), metric = character())
for (n in companies) { # or n in nrow
company = n # or company = URL.table[n,"companyname"]
url.rev = as.character(URL.table %>% filter(companyname == company & metric == "revenue") %>% select("url")) # as.character
url.head = as.character(URL.table %>% filter(companyname == company & metric == "headcount") %>% select("url")) #
n.revenue <- fetch.data(company, "revenue", url.rev)
n.headcount <- fetch.data(company, "headcount", url.head)
output <- bind_rows(output, n.revenue, n.headcount)
}
# output <- output %>% group_by(company, year) %>% mutate(revperemp = value(revenue) / value(headcount))
write.csv(output, file="outputdata.csv")
# y <- fetch.data("General Motors", "revenue", "https://www.macrotrends.net/stocks/charts/GM/general-motors/revenue")
Note that the revenues are in millions ($) while employee count is regular numbers. So if the dots or lines meet, that represents average revenue per employee is $1 million. When the revenue curve (blue) is below the headcount curve (red) then average revenue per employee is less than $1 million, and so on.
df.eg = output %>% filter(company %in% companies) %>% group_by(company, metric)
# or company %in% URL.table[c(1:19,57, 101:110), "companyname"]
ggplot(df.eg, aes(x=year, y=value)) + geom_point(aes(shape=metric), size=0.75) + geom_line(aes(color=metric)) + facet_wrap(~company, scales = "free") + theme_classic() + theme(strip.text = element_text(size=15))
An example call,
fb.revenue <- fetch.data("Facebook", "revenue", "https://www.macrotrends.net/stocks/charts/FB/facebook/revenue")
fb.headcount <- fetch.data("Facebook", "headcount", "https://www.macrotrends.net/stocks/charts/FB/facebook/number-of-employees")
Another example,
Of course we do not want to create a separate data object for each call. The next step is to just create one empty data object and then keep adding rows to it.