To measure the growth of a company, we can use various business and market related metrics. One interesting question we can ask is whether we can analyse and estimate the growth of a company by using the number of LinkedIn profiles and their followers. I will use a data set provided by Thinknum (can be found at the Data Incubator blog) which tracks the number of employees on LinkedIn platform and provides real time insights about the growth of the companies. Another interesting question we can ask is whether the social media popularity can be a significant indicator for the growth performance of that company. For this purpose, I am also going to use the Facebook follower data of the same company and check if a company’s employee growth is strongly correlated with the social media popularity. Basically, I am interested to answer the following question: 1. If there is any correlation, is it statistically significant? 2. How can social media popularity can be used as a metric for the growth of a company?
3. Can social media activity be used to forecast the company’s growth? If it does which type of company’s growth are mostly affected by the variation of the social media popularity?
I will use two data sets provided by Thinknum (can be found at the Data Incubator blog):
First we load the two data sets and useful packages. It takes a little time, as the data set is big.
library(dplyr)
library(ggplot2)
library(gridExtra)
linkedin = read.csv("temp_datalab_records_linkedin_company.csv")
facebook = read.csv("temp_datalab_records_social_facebook.csv")
Next we do a simple summary of the interesting columns of the two data sets.
summary(linkedin[,c(3,4,5,7)])
## company_name followers_count employees_on_platform
## City National Bank: 1605 Min. : 0 Min. : 0
## American Airlines : 1029 1st Qu.: 2148 1st Qu.: 218
## Apple : 1025 Median : 9335 Median : 1083
## Activision : 1024 Mean : 71677 Mean : 7587
## Amgen : 1024 3rd Qu.: 38642 3rd Qu.: 4513
## Cisco : 1024 Max. :7833967 Max. :577952
## (Other) :2419465
## industry
## Banking : 168364
## Biotechnology : 152710
## Financial Services: 148143
## Oil & Energy : 116830
## Retail : 95384
## Pharmaceuticals : 92107
## (Other) :1652658
summary(facebook[,c(3,4,7,8,9)])
## username checkins likes
## : 120929 Min. : 0 Min. : 1
## 2u : 1222 1st Qu.: 0 1st Qu.: 2500
## aflacduck : 1222 Median : 13 Median : 20477
## ModelNInc : 1222 Mean : 14170 Mean : 816625
## RedRobin : 1222 3rd Qu.: 286 3rd Qu.: 217579
## shutterfly: 1222 Max. :17290550 Max. :210641077
## (Other) :3494352
## talking_about_count facebook_id
## Min. : 0 Min. :5.182e+09
## 1st Qu.: 27 1st Qu.:9.448e+10
## Median : 251 Median :1.123e+14
## Mean : 10043 Mean :1.738e+14
## 3rd Qu.: 2474 3rd Qu.:1.941e+14
## Max. :5747010 Max. :1.015e+16
##
To start with some exploratory plots, I pick some well-known established companies which appear in the both data sets and then I plot the time series data. For example, first I picked “McDonald’s” and plot it’s number of Facebook likes and linked in profile.
mcdonalds.lkd = linkedin[linkedin$company_name=="McDonald's",c(2,5)]
mcdonalds.fb = facebook[facebook$username=="McDonalds",c(2,7)]
mcdonalds.lkd$as_of_date = as.POSIXct(mcdonalds.lkd$as_of_date)
mcdonalds.fb$time = as.POSIXct(mcdonalds.fb$time)
# plotting
m.lkd = ggplot(mcdonalds.lkd, aes(x = as_of_date, y = employees_on_platform)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("McDonald's Employees")
m.fb = ggplot(mcdonalds.fb, aes(x = time, y = likes)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("McDonald's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)
From the plot, we can infer that at the beginning of 2018, Starbucks social media popularity didn’t perform as well as the number of employees. There might be some unknown variables which has effected badly their social media presence.
For the next plot we pick another well known company “McDonald” which has a high talking about count in the social media. We draw similar plot.
starbucks.lkd = linkedin[linkedin$company_name=="Starbucks",c(2,5)]
starbucks.fb = facebook[facebook$username=="Starbucks",c(2,7)]
starbucks.lkd$as_of_date = as.POSIXct(starbucks.lkd$as_of_date)
starbucks.fb$time = as.POSIXct(starbucks.fb$time)
# plotting
s.lkd = ggplot(starbucks.lkd, aes(x = as_of_date, y = employees_on_platform)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Starbucks Employees")
s.fb = ggplot(starbucks.fb, aes(x = time, y = likes)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Starbucks Likes")
grid.arrange(s.lkd, s.fb, nrow = 2)
For McDonald’s, the employee growth and social media growth looks highly correlated.
It would be an interesting question how this number of employees and social media popularity are correlated for other medium to small scale companies.
Both McDonald’s and Starbucks are food related companies. We will Plot the same kind of plot for some other big companies. For example, lets take some tech companies like Microsoft and Google and inspect the similar plots of their facebook likes and emplyee number.
# For Microsoft's likes/employee count
microsoft.lkd = linkedin[linkedin$company_name=="Microsoft",c(2,5)]
microsoft.fb = facebook[facebook$username=="Microsoft",c(2,7)]
microsoft.lkd$as_of_date = as.POSIXct(microsoft.lkd$as_of_date)
microsoft.fb$time = as.POSIXct(microsoft.fb$time)
# plotting
m.lkd = ggplot(microsoft.lkd, aes(x = as_of_date, y = employees_on_platform)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Employees")
m.fb = ggplot(microsoft.fb, aes(x = time, y = likes)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)
# For Google's likes/employee count
google.lkd = linkedin[linkedin$company_name=="Google",c(2,5)]
google.fb = facebook[facebook$username=="Google",c(2,7)]
google.lkd$as_of_date = as.POSIXct(google.lkd$as_of_date)
google.fb$time = as.POSIXct(google.fb$time)
# plotting
m.lkd = ggplot(google.lkd, aes(x = as_of_date, y = employees_on_platform)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Google's Employees")
m.fb = ggplot(microsoft.fb, aes(x = time, y = likes)) +
geom_line(size = 1, color = "blue") + xlab("Time")+ylab("Microsoft's Likes")
grid.arrange(m.lkd, m.fb, nrow = 2)
For both of these companies we can see that there is a similar kind of uptrend in both the number of likes and the number of employees.
First note that this dataset is a time series data. Since I am working with two datasets, for each company I have two time series: one representing the number of likes and the other number of employees. But the timestamps of this two time series are different as we can see from our exploratory analysis. So the first thing I will do is merge the two time series from a suitable starting point and then use the likes count and employee count to figure out if there is any correlation.
Also, both the data sets may not contain the same number of companies. So for my preliminary analysis I have handpicked few well known companies and for each company, I ran some simple analysis.
mcdonalds.fb$time = as.Date(mcdonalds.fb$time)
colnames(mcdonalds.lkd)[1]='time'
mcdonalds.lkd$time = as.Date(mcdonalds.lkd$time)
start = as.Date("2016-04-01")
mcd.fb = mcdonalds.fb[mcdonalds.fb$time >=start, ]
mcd.lkd = mcdonalds.lkd[mcdonalds.lkd$time >=start, ]
# merge two time series:
mcd = merge(mcd.fb,mcd.lkd, by = 'time', all.x = TRUE, all.y = TRUE)
head(mcd)
## time likes employees_on_platform
## 1 2016-04-01 63337311 104180
## 2 2016-04-02 63367236 104227
## 3 2016-04-03 63394484 104325
## 4 2016-04-04 63425180 104357
## 5 2016-04-05 63457259 104447
## 6 2016-04-06 63485511 104621
It is understandable that after merging the two data sets there might be some missing values. To fill up those missing values we use the nearest value.
# Helper function. This code snippet was taken from
# https://stackoverflow.com/questions/10077415/replacing-nas-in-r-with-nearest-value
filling_na <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
mcd$likes = filling_na(mcd$likes)
mcd$employees_on_platform = filling_na(mcd$employees_on_platform)
Now I will do some scaling to make this two time series easier for further computation. Also, I will create two time series objects in R to represent them separately. Then to get rid of the trend in time series, I will use the difference function and then calculate the cross-correlation between the two time series.
library(xts)
library(zoo)
x = scale(mcd$likes)
y = scale(mcd$employees_on_platform)
x = xts(x, order.by = mcd$time)
y = xts(y, order.by = mcd$time)
Now let’s plot all this information on the same graph.
par(mfrow=c(2,2))
plot(x)
plot(y)
plot(diff(x))
plot(diff(y))
Now I will plot the cross-correlation between the two time setPrimitiveMethods(
ccf1 = ccf(as.numeric(diff(x)),as.numeric(diff(y)),na.action=na.omit, lag.max =30, plot = FALSE)
plot(ccf1, main = "Cross-correlation")
From this plot, it can be said that this two time series are highle correlated. So for McDonalds, social media popularity might have a very high effect on the growth of the company.
Next, we will do similar kind of analysis on the starbucks data and we get the following results.
starbucks.fb$time = as.Date(starbucks.fb$time)
colnames(starbucks.lkd)[1]='time'
starbucks.lkd$time = as.Date(starbucks.lkd$time)
start = as.Date("2016-09-20")
st.fb = starbucks.fb[starbucks.fb$time >=start, ]
st.lkd = starbucks.lkd[mcdonalds.lkd$time >=start, ]
# merge two time series:
st = merge(st.fb,st.lkd, by = 'time', all.x = TRUE, all.y = TRUE)
head(st)
## time likes employees_on_platform
## 1 2016-03-06 NA 65798
## 2 2016-03-07 NA 65797
## 3 2016-03-08 NA 65861
## 4 2016-03-09 NA 65952
## 5 2016-03-10 NA 66006
## 6 2016-03-11 NA 66055
# filling up NAs
st$likes = filling_na(st$likes)
st$employees_on_platform = filling_na(st$employees_on_platform)
x_st = scale(st$likes)
y_st = scale(st$employees_on_platform)
x_st = xts(x_st, order.by = st$time)
y_st = xts(y_st, order.by = st$time)
par(mfrow=c(2,2))
plot(x_st)
plot(y_st)
plot(diff(x_st))
plot(diff(y_st))
par(mfrow= c(1,1))
ccf1_st = ccf(as.numeric(diff(x_st)),as.numeric(diff(y_st)),na.action=na.omit, lag.max =30, plot = FALSE)
plot(ccf1_st, main = "Cross-correlation")
In this part of my I will use the time seris of facebook likes to forecast the employee growth of a company. First I will create some simple model. Let \(Y_t\) represent the number of employees and \(X_t\) represent the facebook likes of the company. So we can make a linear model with \(Y_t\) as the target and \(X_t\) as the predictor.
\[ Y_t = \beta_1 X_t + \beta_2 + \epsilon_t \] Here, \(\epsilon_t\) is a random error.
library(latex2exp)
plot(as.numeric(x), as.numeric(y), xlab = TeX("$X_t$"), ylab = TeX('Y_t'))
We get the following summary.
# for mcdonald's
fit = lm(y~x) # regress Y_t of X_t
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54337 -0.11348 -0.03133 0.13967 0.41545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.128e-15 7.246e-03 0.0 1
## x 9.778e-01 7.250e-03 134.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2098 on 836 degrees of freedom
## Multiple R-squared: 0.9561, Adjusted R-squared: 0.956
## F-statistic: 1.819e+04 on 1 and 836 DF, p-value: < 2.2e-16
# for starbucks
fit = lm(y~x) # regress Y_t of X_t
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54337 -0.11348 -0.03133 0.13967 0.41545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.128e-15 7.246e-03 0.0 1
## x 9.778e-01 7.250e-03 134.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2098 on 836 degrees of freedom
## Multiple R-squared: 0.9561, Adjusted R-squared: 0.956
## F-statistic: 1.819e+04 on 1 and 836 DF, p-value: < 2.2e-16
Here is some future plans about this project.
We can run this same type of analysis for other companies ( any medium to small range companies). Also for different types of company we might get different results. For example, companies related to food/clothes consumption might have strong correlation between their employee growth and social media popularity. On the other hand Banking/financial company’s growth may not be highly correlated. Since I have the company type on the linkedin data set, I can create a new model which can tell us whether this assumption is statistically significant.
Find Granger causality between this two time series, if there is any.
Use neural networks to train model on the whole data set with lots of company’s.