Data from http://www.opendata500.com/us/list/

The first step in analyzing our new leads is to do some exploratory data analysis to identify the high priority leads. First, let’s load our sample data and see what it looks like:

library(dplyr)
library(ggplot2)
leads = read.csv("us_companies.csv")
glimpse(leads)
## Observations: 529
## Variables: 22
## $ company_name_id     <fct> 3-round-stones-inc, 48-factoring-inc, 5psolu…
## $ company_name        <fct> "3 Round Stones, Inc.", "48 Factoring Inc.",…
## $ url                 <fct> http://3RoundStones.com, https://www.48facto…
## $ year_founded        <int> 2010, 2014, 2007, 1965, 1999, 1989, 1962, 19…
## $ city                <fct> Washington,  Philadelphia, Fairfax, Cambridg…
## $ state               <fct> DC, PA, VA, MA, CA, IL, PA, AR, CA, CA, NY, …
## $ country             <fct> us, us, us, us, us, us, us, us, us, us, us, …
## $ zip_code            <int> 20004, 19087, 22003, 2138, 94583, 60601, 168…
## $ full_time_employees <fct> "1-10", "51-200", "1-10", "1,001-5,000", "50…
## $ company_type        <fct> Private, Private, Private, Private, Private,…
## $ company_category    <fct> Data/Technology, Finance & Investment, Data/…
## $ revenue_source      <fct> "Data analysis for clients, Database licensi…
## $ business_model      <fct> "Business to Business, Business to Consumer"…
## $ social_impact       <fct> "", "Small Business Owners", "", "", "Citize…
## $ description         <fct> "3 Round Stones produces a platform for publ…
## $ description_short   <fct> "Our Open Source platform is used by the For…
## $ source_count        <fct> NA, 11-50, NA, 101+, 101+, NA, NA, 101+, NA,…
## $ data_types          <fct> "", "Business", "", "", "Health/Healthcare, …
## $ example_uses        <fct> , , , , , , , , , , , , , , , , , , , , , , …
## $ data_impacts        <fct> "[]", "[u'Cost efficiency', u'Job growth', u…
## $ financial_info      <fct> "3 Round Stones is a profitable, self-funded…
## $ last_updated        <fct> 2014-11-12 14:44:25.969871, 2015-05-18 11:36…
ggplot(data = leads, aes(x = year_founded)) + geom_bar(stat = "count")

leads$full_time_employees = factor(leads$full_time_employees, 
    levels = c("1-10", "11-50", "51-200", "201-500", "501-1,000", 
        "1,001-5,000", "5,001-10,000", "10,001+"))
ggplot(data = leads, aes(x = full_time_employees)) + geom_bar(stat = "count")

leads_ct = leads %>% count(company_type) %>% arrange(-n) %>% 
    mutate(perc = round((n/nrow(leads)) * 100, 1))
leads_cc = leads %>% count(company_category) %>% arrange(-n) %>% 
    mutate(perc = round((n/nrow(leads)) * 100, 1))
leads_bm = leads %>% count(business_model) %>% arrange(-n) %>% 
    mutate(perc = round((n/nrow(leads)) * 100, 1))
leads_st = leads %>% count(state) %>% arrange(-n) %>% mutate(perc = round((n/nrow(leads)) * 
    100, 1))

 

We see that the majority of the companies here were founded fairly recently (since the year 2000), and are relatively small. The most common company size is just 1-10 employees, with 11-50 and 51-200 the next largest in size. However, there are a fair amount of 10,001+ employee companies.

 

# Company Type
datatable(leads_ct, options = list(dom = "t"))
# Company Category
datatable(leads_cc, options = list(dom = "t"))
# Company Business Model
datatable(leads_bm, options = list(dom = "t"))
# Company Location
datatable(leads_st, options = list(dom = "t"))

 

Continuing our exploratory data analysis, we see that the vast majority (75%) of our companies are privately owned. The most popular industries are Data/Technology and Finance & Investement, and the most common business models are Business to Business and Business to Consumer, though there are quite a few companies with no data on their business model. Most of our leads are headquartered in California and New York, with only 11 in Colorado.

Because these leads are brand new and we have no data on their quality, lets make the assumption that the leads we want to focus on are those in Colorado and surrounding states and have more than 50 employees (excluding the 1-10 and 11-50 employee buckets). We will scrub out some of the leads with incomplete data as well.

 

nearby_states = c("CA", "OR", "WA", "ID", "NV", "MT", "ND", "SD", 
    "CO", "WY", "OK", "AZ", "UT", "NM", "TX", "NE", "KS")
leads_best = leads %>% filter(!full_time_employees %in% c("1-10", 
    "11-50") & state %in% nearby_states) %>% na.omit() %>% select(company_name, 
    url, year_founded, city, state, full_time_employees, company_type, 
    company_category, business_model)

datatable(leads_best, options = list(dom = "t"))

 

We are thus left with 36 high priority leads that could be given to the most experienced salespeople. The remaining leads can be distributed based on industries that we would specificially like to target. As sales reps contact the leads and collect data in CRM, we will be able to compute RFM scores and construct segments for hot and cold leads. Below, I have used existing data and imputed RFM scores using random numbers.

RFM (recency, frequency, monetary) analysis is a behavior based technique used to segment customers by examining their transaction history: how recently a customer has purchased (recency), how often they purchase (frequency), how much the customer spends (monetary). Here we will use a range of values from 1 to 3, with 1 being worst and 3 being best.

 

leads_rfm = leads %>% select(company_name, url, year_founded, 
    city, state, full_time_employees, company_type, company_category, 
    business_model)

set.seed(99)
leads_rfm$recency_score = sample(1:3, size = nrow(leads_rfm), 
    replace = TRUE)
leads_rfm$frequency_score = sample(1:3, size = nrow(leads_rfm), 
    replace = TRUE)
leads_rfm$monetary_score = sample(1:3, size = nrow(leads_rfm), 
    replace = TRUE)
leads_rfm$rfm_score = paste0(leads_rfm$recency_score, leads_rfm$frequency_score, 
    leads_rfm$monetary_score)

table(leads_rfm$rfm_score)
## 
## 111 112 113 121 122 123 131 132 133 211 212 213 221 222 223 231 232 233 
##  26  17  20  12  23  17  21  13  14  18  30  21  20  17  22  18  17  24 
## 311 312 313 321 322 323 331 332 333 
##  19  22  12  30  19  18  18  22  19

 

The resulting RFM represents different segments of leads. For instance, we have 26 “111” leads, which represents the lead that purchased a long time ago, does not purchase often, and spent the least amount of money. On the opposite side we have 19 “333” leads, which represent the best and most loyal customers. A “333” might not need much additional convincing to renew, and a “111” might be totally uninterested in renewing, but a borderline customer in the “222” RFM range would require the most attention to keep them going cold. Thus, sales reps would be instructed to focus on those borderline leads at renewal time.

 

As renewal data is gathered, we can use it to inform a logistic regression model to project the probability of renewal/nonrenewal for any future business leads we purchase. Future leads would then be quantified by a RFM and renewal probability score that could be communicated to sales reps via CRM, so they would be able to prioritize their work load.