Overview

To measure the growth of a company, we can use various business and market related metrics. One interesting question we can ask is whether we can analyse and estimate the growth of a company by using the number of LinkedIn profiles and their followers. I will use a data set provided by Thinknum (can be found at the Data Incubator blog) which tracks the number of employees on LinkedIn platform and provides real time insights about the growth of the companies. Another interesting question we can ask is whether the social media popularity can be a significant indicator for the growth performance of that company. For this purpose, I am also going to use the Facebook follower data of the same company and check if a company’s employee growth is strongly correlated with the social media popularity. Basically, I am interested to answer the following question: 1. If there is any correlation, is it statistically significant? 2. How can social media popularity can be used as a metric for the growth of a company?
3. Can social media activity be used to forecast the company’s growth? If it does which type of company’s growth are mostly affected by the variation of the social media popularity?

Data Source

I will use two data sets provided by Thinknum (can be found at the Data Incubator blog):

LinkedIn Profiles : contains information about different companies’ number of employees, followers on a daily basis. The size of this data set is 441MB and it contains approximately 3.6 million rows.
Facebook followers: contains information about different companies’ number of likes, talking about counts, check ins etc.This data set is almost 1GB and contains 2.5 million rows.

Some Exploratory Analysis

First we load the two data sets and useful packages. It takes a little time, as the data set is big.

library(dplyr)
library(ggplot2)
library(gridExtra)

linkedin = read.csv("temp_datalab_records_linkedin_company.csv")
facebook = read.csv("temp_datalab_records_social_facebook.csv")

Next we do a simple summary of the interesting columns of the two data sets.

summary(linkedin[,c(3,4,5,7)])

##              company_name     followers_count   employees_on_platform
##  City National Bank:   1605   Min.   :      0   Min.   :     0       
##  American Airlines :   1029   1st Qu.:   2148   1st Qu.:   218       
##  Apple             :   1025   Median :   9335   Median :  1083       
##  Activision        :   1024   Mean   :  71677   Mean   :  7587       
##  Amgen             :   1024   3rd Qu.:  38642   3rd Qu.:  4513       
##  Cisco             :   1024   Max.   :7833967   Max.   :577952       
##  (Other)           :2419465                                          
##                industry      
##  Banking           : 168364  
##  Biotechnology     : 152710  
##  Financial Services: 148143  
##  Oil & Energy      : 116830  
##  Retail            :  95384  
##  Pharmaceuticals   :  92107  
##  (Other)           :1652658

summary(facebook[,c(3,4,7,8,9)])

##        username          checkins            likes          
##            : 120929   Min.   :       0   Min.   :        1  
##  2u        :   1222   1st Qu.:       0   1st Qu.:     2500  
##  aflacduck :   1222   Median :      13   Median :    20477  
##  ModelNInc :   1222   Mean   :   14170   Mean   :   816625  
##  RedRobin  :   1222   3rd Qu.:     286   3rd Qu.:   217579  
##  shutterfly:   1222   Max.   :17290550   Max.   :210641077  
##  (Other)   :3494352                                         
##  talking_about_count  facebook_id       
##  Min.   :      0     Min.   :5.182e+09  
##  1st Qu.:     27     1st Qu.:9.448e+10  
##  Median :    251     Median :1.123e+14  
##  Mean   :  10043     Mean   :1.738e+14  
##  3rd Qu.:   2474     3rd Qu.:1.941e+14  
##  Max.   :5747010     Max.   :1.015e+16  
##

To start with some exploratory plots, I pick some well-known established companies which appear in the both data sets and then I plot the time series data. For example, first I picked “McDonald’s” and plot it’s number of Facebook likes and linked in profile.

mcdonalds.lkd = linkedin[linkedin$company_name=="McDonald's",c(2,5)]
mcdonalds.fb = facebook[facebook$username=="McDonalds",c(2,7)]

mcdonalds.lkd$as_of_date = as.POSIXct(mcdonalds.lkd$as_of_date)
mcdonalds.fb$time = as.POSIXct(mcdonalds.fb$time)

# plotting
m.lkd = ggplot(mcdonalds.lkd, aes(x = as_of_date, y = employees_on_platform)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("McDonald's Employees")
m.fb = ggplot(mcdonalds.fb, aes(x = time, y = likes)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("McDonald's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)

From the plot, we can infer that at the beginning of 2018, Starbucks social media popularity didn’t perform as well as the number of employees. There might be some unknown variables which has effected badly their social media presence.

For the next plot we pick another well known company “McDonald” which has a high talking about count in the social media. We draw similar plot.

starbucks.lkd = linkedin[linkedin$company_name=="Starbucks",c(2,5)]
starbucks.fb = facebook[facebook$username=="Starbucks",c(2,7)]

starbucks.lkd$as_of_date = as.POSIXct(starbucks.lkd$as_of_date)
starbucks.fb$time = as.POSIXct(starbucks.fb$time)

# plotting
s.lkd = ggplot(starbucks.lkd, aes(x = as_of_date, y = employees_on_platform)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Starbucks Employees")
s.fb = ggplot(starbucks.fb, aes(x = time, y = likes)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Starbucks Likes")
grid.arrange(s.lkd, s.fb, nrow = 2)

For McDonald’s, the employee growth and social media growth looks highly correlated.

It would be an interesting question how this number of employees and social media popularity are correlated for other medium to small scale companies.

Both McDonald’s and Starbucks are food related companies. We will Plot the same kind of plot for some other big companies. For example, lets take some tech companies like Microsoft and Google and inspect the similar plots of their facebook likes and emplyee number.

# For Microsoft's likes/employee count
microsoft.lkd = linkedin[linkedin$company_name=="Microsoft",c(2,5)]
microsoft.fb = facebook[facebook$username=="Microsoft",c(2,7)]

microsoft.lkd$as_of_date = as.POSIXct(microsoft.lkd$as_of_date)
microsoft.fb$time = as.POSIXct(microsoft.fb$time)

# plotting
m.lkd = ggplot(microsoft.lkd, aes(x = as_of_date, y = employees_on_platform)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Employees")
m.fb = ggplot(microsoft.fb, aes(x = time, y = likes)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)

# For Google's likes/employee count
google.lkd = linkedin[linkedin$company_name=="Google",c(2,5)]
google.fb = facebook[facebook$username=="Google",c(2,7)]

google.lkd$as_of_date = as.POSIXct(google.lkd$as_of_date)
google.fb$time = as.POSIXct(google.fb$time)

# plotting
m.lkd = ggplot(google.lkd, aes(x = as_of_date, y = employees_on_platform)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Google's Employees")
m.fb = ggplot(microsoft.fb, aes(x = time, y = likes)) +
  geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)

For both of these companies we can see that there is a similar kind of uptrend in both the number of likes and the number of employees.

Preprocessing and Some Analysis

First note that this dataset is a time series data. Since I am working with two datasets, for each company I have two time series: one representing the number of likes and the other number of employees. But the timestamps of this two time series are different as we can see from our exploratory analysis. So the first thing I will do is merge the two time series from a suitable starting point and then use the likes count and employee count to figure out if there is any correlation.

Also, both the data sets may not contain the same number of companies. So for my preliminary analysis I have handpicked few well known companies and for each company, I ran some simple analysis.

mcdonalds.fb$time = as.Date(mcdonalds.fb$time)

colnames(mcdonalds.lkd)[1]='time'
mcdonalds.lkd$time = as.Date(mcdonalds.lkd$time)


start = as.Date("2016-04-01")
mcd.fb = mcdonalds.fb[mcdonalds.fb$time >=start, ]
mcd.lkd = mcdonalds.lkd[mcdonalds.lkd$time >=start, ] 

# merge two time series:
mcd = merge(mcd.fb,mcd.lkd, by = 'time', all.x = TRUE, all.y = TRUE)

head(mcd)

##         time    likes employees_on_platform
## 1 2016-04-01 63337311                104180
## 2 2016-04-02 63367236                104227
## 3 2016-04-03 63394484                104325
## 4 2016-04-04 63425180                104357
## 5 2016-04-05 63457259                104447
## 6 2016-04-06 63485511                104621

It is understandable that after merging the two data sets there might be some missing values. To fill up those missing values we use the nearest value.

# Helper function. This code snippet was taken from
# https://stackoverflow.com/questions/10077415/replacing-nas-in-r-with-nearest-value
filling_na <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}


mcd$likes = filling_na(mcd$likes)
mcd$employees_on_platform = filling_na(mcd$employees_on_platform)

Now I will do some scaling to make this two time series easier for further computation. Also, I will create two time series objects in R to represent them separately. Then to get rid of the trend in time series, I will use the difference function and then calculate the cross-correlation between the two time series.

library(xts)
library(zoo)

x = scale(mcd$likes)
y = scale(mcd$employees_on_platform)

x = xts(x, order.by = mcd$time)
y = xts(y, order.by = mcd$time)

Now let’s plot all this information on the same graph.

par(mfrow=c(2,2))
plot(x)
plot(y)
plot(diff(x))
plot(diff(y))

Now I will plot the cross-correlation between the two time setPrimitiveMethods(

ccf1 = ccf(as.numeric(diff(x)),as.numeric(diff(y)),na.action=na.omit, lag.max =30, plot = FALSE)
plot(ccf1, main = "Cross-correlation")

From this plot, it can be said that this two time series are highle correlated. So for McDonalds, social media popularity might have a very high effect on the growth of the company.

Next, we will do similar kind of analysis on the starbucks data and we get the following results.

starbucks.fb$time = as.Date(starbucks.fb$time)

colnames(starbucks.lkd)[1]='time'
starbucks.lkd$time = as.Date(starbucks.lkd$time)


start = as.Date("2016-09-20")
st.fb = starbucks.fb[starbucks.fb$time >=start, ]
st.lkd = starbucks.lkd[mcdonalds.lkd$time >=start, ] 

# merge two time series:
st = merge(st.fb,st.lkd, by = 'time', all.x = TRUE, all.y = TRUE)

head(st)

##         time likes employees_on_platform
## 1 2016-03-06    NA                 65798
## 2 2016-03-07    NA                 65797
## 3 2016-03-08    NA                 65861
## 4 2016-03-09    NA                 65952
## 5 2016-03-10    NA                 66006
## 6 2016-03-11    NA                 66055

# filling up NAs
st$likes = filling_na(st$likes)
st$employees_on_platform = filling_na(st$employees_on_platform)

x_st = scale(st$likes)
y_st = scale(st$employees_on_platform)

x_st = xts(x_st, order.by = st$time)
y_st = xts(y_st, order.by = st$time)

par(mfrow=c(2,2))
plot(x_st)
plot(y_st)
plot(diff(x_st))
plot(diff(y_st))

par(mfrow= c(1,1))
ccf1_st = ccf(as.numeric(diff(x_st)),as.numeric(diff(y_st)),na.action=na.omit, lag.max =30, plot = FALSE)
plot(ccf1_st, main = "Cross-correlation")

Forecasting

In this part of my I will use the time seris of facebook likes to forecast the employee growth of a company. First I will create some simple model. Let \(Y_t\) represent the number of employees and \(X_t\) represent the facebook likes of the company. So we can make a linear model with \(Y_t\) as the target and \(X_t\) as the predictor.

\[ Y_t = \beta_1 X_t + \beta_2 + \epsilon_t \] Here, \(\epsilon_t\) is a random error.

library(latex2exp)
plot(as.numeric(x), as.numeric(y), xlab = TeX("$X_t$"), ylab = TeX('Y_t'))

We get the following summary.

# for mcdonald's
fit = lm(y~x) # regress Y_t of X_t
summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54337 -0.11348 -0.03133  0.13967  0.41545 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.128e-15  7.246e-03     0.0        1    
## x           9.778e-01  7.250e-03   134.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2098 on 836 degrees of freedom
## Multiple R-squared:  0.9561, Adjusted R-squared:  0.956 
## F-statistic: 1.819e+04 on 1 and 836 DF,  p-value: < 2.2e-16

# for starbucks
fit = lm(y~x) # regress Y_t of X_t
summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54337 -0.11348 -0.03133  0.13967  0.41545 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.128e-15  7.246e-03     0.0        1    
## x           9.778e-01  7.250e-03   134.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2098 on 836 degrees of freedom
## Multiple R-squared:  0.9561, Adjusted R-squared:  0.956 
## F-statistic: 1.819e+04 on 1 and 836 DF,  p-value: < 2.2e-16

Further Analysis

Here is some future plans about this project.

We can run this same type of analysis for other companies ( any medium to small range companies). Also for different types of company we might get different results. For example, companies related to food/clothes consumption might have strong correlation between their employee growth and social media popularity. On the other hand Banking/financial company’s growth may not be highly correlated. Since I have the company type on the linkedin data set, I can create a new model which can tell us whether this assumption is statistically significant.
Find Granger causality between this two time series, if there is any.
Use neural networks to train model on the whole data set with lots of company’s.

Reference

Shumway, Robert H., and David S. Stoffer. Time series analysis and its applications: with R examples. Springer, 2017.

Data Incubator Project

Shouman Das

February 20, 2019