—————————————————————————————————————————————————————————————

Statistical Analysis and Modeling

Part 1: The Gift of Twitter Gab

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, four times on three different dates: March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, which we’ll discuss in part two, followed by discussion of the TDM matrix in part three. The data aggregation will be completed separately, but the analysis of the three parts will be done jointly.

## Loading required package: plyr

## Loading required package: RColorBrewer

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:graphics':
## 
##     layout

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, on three different dates, March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, but we’ll leave that for part two.

Step 1: Load the Data

We’ll load in the Twitter frequency data from our online repository; since the first column is merely line numbers, we can excise that.

twitter.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/frequency_results.csv")

twitter <- read.csv(twitter.url, stringsAsFactors = FALSE, sep = ",")
twitter <- twitter[,2:4]

View(twitter)
kable(head(twitter))

skill_id	t_freq	dates
1	0	2016-03-16
2	0	2016-03-16
3	0	2016-03-16
4	0	2016-03-16
5	5	2016-03-16
6	0	2016-03-16

kable(tail(twitter))

	skill_id	dates
591	144	2016-03-20
592	145	2016-03-20
593	146	2016-03-20
594	147	2016-03-20
595	148	2016-03-20
596	149	2016-03-20

The skills are simply numbered from 1 onwards - we see that there are 149 skills.

Step 2: Aggregating & Tidying the Data

Looking at the above tables, we see have multiple dates worth of data. Since these days are such days apart, it’s not worth analyzing any temporal trend - eg, how certain skills have become more or less popular over time. If we had gathered data years apart, then that would have been a more fruitful exercise.

Let’s combine that data into one table of 149 rows, along with the respective skill titles Our skill titles are stored in a separate table.

First, we subset the data table into its respective portions, and load the skill titles table.

twitPart1 <- subset(twitter, dates == "2016-03-16")
twitPart2 <- subset(twitter, dates == "2016-03-19")[1:149, ]
twitPart3 <- subset(twitter, dates == "2016-03-19")[150:298, ]
twitPart4 <- subset(twitter, dates == "2016-03-20")

skillTitle.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/skillsAsher.csv") # The URL where the file listing the skill titles is located.
skillTitle <- read.csv(skillTitle.url, stringsAsFactors = FALSE) # Reads the skill titles into R.

twitAllDates <- data.frame(skill_id = twitPart1$skill_id, t_freq = twitPart1$t_freq + twitPart2$t_freq + twitPart3$t_freq + twitPart4$t_freq, skill_title = skillTitle$skill_name, stringsAsFactors = FALSE)

Let’s Kick Out the Losers: Skills with Zero Mentions

Looking at the data, we see that a number of skills don’t get mentioned at all. How many?

zeroTwits <-subset(twitAllDates, twitAllDates$t_freq == 0) # The subset of frequencies that are zero
nrow(twitAllDates) #The total number of skills

## [1] 149

nrow(zeroTwits) #The number of skills with zero mentions

## [1] 84

nrow(zeroTwits)/nrow(twitAllDates) #The proportion of skills with zero mentions

## [1] 0.5637584

About 56% of the skills we searched for were never mentioned. To reduce the chance of leaving out an important skill, we clearly included lots of skills that were not commonly talked about. That so many of our skills garnered no mention is not a cause for concern here. You could say our search was sensitive, but not specific :).

However, it’s important to note, that where there is a significant cost associated with gathering data, one must be more judicious about selecting what data to gather - then you can’t just dream up a Christmas wishlist of variables and ask for it all.

Step 3: Sorting the Data

From here, we’ll limit our investigation to the skills with positive frequencies, hereafter twitPositive.

twitPositive <- subset(twitAllDates, twitAllDates$t_freq > 0)
twitSort <- twitPositive[order(-twitPositive$t_freq), ] #Sort results by frequency, descending
View(twitSort)

We’ll have to do some minor cleaning - we have both ‘machinelearning’ and ‘machine learning’ in our dataset, and they both rank very high. We’ll combine them into one entry, named ‘ML,’ because otherwise it will be long and make plotting natively in R more troublesome.

Also, one of our top results has a title that is too long, ‘predictive analytics’ - that will be shortened to ‘pred. analysis.’

MLRowNum <- which(twitSort$skill_title == "machinelearning") #The row number of machinelearning
M_LRowNum <- which(twitSort$skill_title == "machine learning")#The row number of machine learning (i.e. with a space in between)
twitSort$skill_title[MLRowNum] <- "ML" #Renames machinelearning to ML
twitSort$t_freq[MLRowNum] <- twitSort$t_freq[MLRowNum] + twitSort$t_freq[M_LRowNum] #Sums the two ML frequencies together
twitSort <- twitSort[-M_LRowNum, ] #Deletes the duplicate row
#which(twitSort$skill_title == "machine learning")


PARowNum <- which(twitSort$skill_title == "predictive analytics")
twitSort$skill_title[PARowNum] <- "pred. analysis"

View(twitSort)

Before we begin to visualize and understand the data, we’ll do all the previous steps again for the mentions we gathered from published articles. Then we’ll compare the data from each source side by side.

Part II: Mentions of Data Science Skills in the Press

We’ll replicate the process we did with Twitter for our dataset of mentions in the press. Our sample is 91 published articles. We checked each article against a similar list of skills, to count the number of mentions.

Our datasets were formatted slightly differently, because different teams were involved, and no standards were hashed out beforehand; it’s not a big deal here, because the data is simple and we are doing this project as a ‘one-off’ but if this were a routine activity, we’d want to ensure the datasets were formatted identically, so we’d want to spend some time standardizing the data output format before gathering the data.

Step 1: Loading the data

articleURL <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/Build-URL_DataFrame-Output.csv")
articleData <- read.csv(articleURL, stringsAsFactors = FALSE, sep = ",")
View(head(articleData))

Step 2: Aggregating & Winnowing the Mentions Across Articles

We’ll want to aggregate the number of mentions in each article to a grand total of sums across articles. We’ll use the aggregate function.

articleAgg <- aggregate(articleData$ds_freq, by=list(Category= articleData$skill_name), FUN=sum)
names(articleAgg) <- c("skill", "frequency")

View(articleAgg)

With a simple function, we’ve consolidated our ~14,000 lines of data into 149. We can winnow this data down further, by removing the skills that did not garner a single mention.

articlePositive <- subset(articleAgg, articleAgg$frequency > 0)
nrow(articlePositive)

## [1] 115

View(articlePositive)

Now we have 115 skills with at least one mention in the articles we studied. 34 skills out of 149, or 23%, were not mentioned once.

Step 3: Sorting the Data

Now, we can start looking at the most frequently mentioned skills. We’ll sort the skills according to their number of mentions, in descending order:

articleSort <- articlePositive[order(-articlePositive$frequency), ] #Sort results by frequency, descending
kable(head(articleSort))

	skill	frequency
14	big data	704
132	Statistics	359
111	R	323
75	Machine Learning	297
54	Hadoop	272
109	programming	246

Part III: Term Document Matrix - Most Frequently Used Words in the Press

The output of the term document matrix was posted on Github, from where we will retrieve it. This has the counts of each word in each article; we will aggregate these word totals into a sum for each word across all articles, and sort them in descending order of frequency.

tdmURL <- "https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/term-document-matrix/tdm-df"

tdmData <- read.csv(url(tdmURL), stringsAsFactors = FALSE, sep = ",")
names(tdmData) <- c("article", "term", "freq")

tdmAgg <- aggregate(tdmData$freq, by=list(Category= tdmData$term), FUN=sum)
tdmSort <- tdmAgg[order(-tdmAgg$x), ] #Sort results by frequency, descending
tdmSort$rank <- seq(1:nrow(tdmSort))

Step 4: Visualizing the Data

We’ll bring back our Twitter data here, so that we can look at the results of all three datasets together.

We’ll consider all three sets of results in barplot form, and use the rainbow() function to distinguish one bar from another.

barplot(twitSort$t_freq, main = "Twitter Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz = TRUE, col = rainbow(nrow(twitSort)))

barplot(articleSort$frequency, main = "Press Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(articleSort)))

barplot(tdmSort$x, main = "Most Used Words in the Press, via TDM", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(tdmSort)))

While this is useful for getting a sense of the distribution of skills, there are simply too many skills to put in a single graph. But these graphs do convey that some skills get many more mentions than others - even after we’ve removed the skills with zero mentions.

Let’s take a look at the top 15 skills

In List Form, the 15 Most Mentioned Skills On Twitter:

kable(twitSort[1:15, 3])

big data

python

statistics

research

infographic

innovation

pandas

spark

hadoop

visualization

programming

regression

pred. analysis

The 15 Most Mentioned Skills in the Press:

kable(articleSort[1:15, 1])

big data

Statistics

Machine Learning

Hadoop

programming

Python

Visualization

Data Mining

Research

SQL

Java

communication

C++

Top 15 Data Science Skills in the Press

top15 <- articleSort[1:15, ]


p <- qplot(skill, frequency, data = top15, color = skill)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Data Science Skills in Wordcloud Form, via the Press

set.seed(1234)
wordcloud(words = articleSort$skill, freq = articleSort$frequency, rot.per=0.45, colors=brewer.pal(8, "Dark2"))

TDM Results

Below follow the 25 most frequent terms in the articles searched.

kable(tdmSort[1:25, ])

	Category	x	rank
245	data	5291	1
364	function	1900	2
776	s	1479	3
786	science	1374	4
1036	var	1319	5
257	data science	1107	6
834	skills	1069	7
1001	true	904	8
793	scientist	877	9
258	data scientist	791	10
610	new	776	11
332	false	720	12
288	document	668	13
167	business	657	14
796	scientists	650	15
157	big	648	16
725	px	618	17
259	data scientists	594	18
296	e	574	19
45	a data	567	20
1071	window	561	21
158	big data	554	22
636	of the	554	23
36	var	531	24
91	analytics	531	25

Below is a list of data science skills parsed from within the top 200 mentions of the TDM results:

kable(tdmSort[c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163), ])

	Category	x	rank
158	big data	554	22
91	analytics	531	25
873	statistics	329	51
86	analysis	288	55
714	programming	233	74
403	hadoop	229	76
566	management	218	78
871	statistical	199	88
726	python	183	104
325	experience	181	105
588	mining	175	112
537	linkedin	158	128
256	data mining	151	137
1046	visualization	150	139
198	cloud	136	163

In barplot form, the top 15:

tdm15 <- rev(c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163))

op <- par(mar = c(4.2,6.3,3,2) + 0.1)
barplot(tdmSort$x[tdm15], main = "The 15 Most Mentioned Data Science Terms, via TDM", xlab = "# of Mentions", names.arg = tdmSort$Category[tdm15], las = 2, col = rainbow(15), cex.names = 0.8, horiz = TRUE)

par(op)

The results picked up some javascript tags, but those have been removed.

Analysis of Results

We see that big data is a clear winner, especially in the press results. But big data has come to be a byword, even a synonym for data science; indeed, one of the main differences between data science and conventional statistical techniques is that data science handles ‘big’ datasets generated by software, whereas statistics tends to focus on smaller or simpler samples, or otherwise refers to the specific mathematical techniques used to analyze the data - and not the earlier processes of gathering and tidying the data.

Big data has also become a buzzword in its own right, one of dubious distinction critics say. The work it describes spans the fields of applied math and computer science - so it can’t be lumped into a bigger field like statistics or computer science. Regardless, it’s safe to conclude that to be a data scientist, one must be comfortable with big data.

Next, at #2 in the press results, we have statistics. Statistics is a core competency of data science - as critical as reflexes are to a racecar driver.

In the Twitter data, the results are fairly consistent from day to day, with machine learning coming out on top; the other skills are closer to one another, so their relative frequency fluctuates a bit.

We could go on and classify each skill, but looking at each of these skills, patterns emerge; we can put them into one of three categories:

Skill Type I: Subject Mastery

Areas of competency are the general subjects and activities that transcend any specific piece of software or field, and are important or essential to being a good data scientist. These are things like numeracy and literacy and fluency in logic and visual aides. These skills are timeless, at least as long as the data scientists are human. Most of these are general, but some are more specific, like machine learning and regression.

It is possible, however, that some of these skills may wax and wane in importance as the tools for doing data science improve. Perhaps you manage to set up a data collection system where little to no data tidying is needed; or some advanced form of artificial intelligence makes creating compelling charts much easier. Certainly, the relative value of these competencies depends on the subject of your work.

Skill Type 2: Tools

These are the specific pieces of software data scientists use to do their jobs - and this is where you’ll see the most change over time, in what is and isn’t fashionable and in demand. Foremost among these are R, Python, Hadoop, SQL and other relational databases. Change is especially swift in areas where software can be changed easily - where organizations aren’t tied up using legacy hardware and software.

Still, some of these tools are staples of the data scientist, and will be for some time. R and Python are two obvious candidates. And even when such tools are replaced, facility with the old ones will help to understand the new ones that replace them. If say, some other database software came to the fore, experience with SQL would ease the transition to the new program.

Skill Type 3: Personal Traits

Different occupations call for different character traits. You hope that your surgeon is careful enough not to amputate the wrong leg, and that your favorite chef has good hygiene. Data scientists too have certain preferred qualities.

Communication: Data scientists should be able to express themselves clearly in non-technical terms, to other people that are data scientists and especially to those who aren’t.
Visualization: The data scientist should excel at using visual aides to make his points.
Perlustration: A data scientist must carefully and constantly examine her data for quirks and mishaps; if such irregularities go unseen, it could throw off the whole analysis.
Curiousity: He must want to learn, both within his domain and without; this will help him find new, useful sources of data, and complete the research necessary. A lack of curiosity and a compartmentalized perspective could lead to missed solutions and stagnation, where innovation is required.

Calling these items ‘traits’ can make them sound hard to change - but they can all be improved with focused practice and study, especially with the abundance of tools that technology offers.

TDM Discussion

This method yields many words that may be common in describing data science - like the word ‘data’ - but they aren’t a skill per se.

The TDM approach is valuable because it will highlight terms that weren’t explicitly searched. If one had no prior information on what skills were useful, the TDM approach would be a valuable tool. In a sense, TDM lets the data ‘tell’ us what is most important. But it still requires human oversight; the results of TDM must be parsed manually, to winnow through the frequent but irrelevant terms. Similar terms like statistics and statistical may be worth combining to determine a more accurate result.

This refining process can easily yield differing results. For instance, in our top 200 terms, what terms to pick out as skills is somewhat arbitrary - python is clearly a skill, but words like analysis, training, insights, experience, research, search, team are a bit more ambiguous - but even those are suggestive of what makes a successful data scientist.

And, as TDM’s large accompanying dataset suggests, it is more computationally intensive. With a larger corpus of work to search, the TDM approach would have to be altered to remain doable.

Of course, the TDM method employed is a simple one. It could be paired with say, a machine learning algorithm, with which it ‘learns’ to distinguish common but unhelpful words like “data” from helpful ones like “statistics.”

And in fact, such search methods have become commonplace in the legal industry, in a process known as discovery; in a given lawsuit, thousands of pages and emails might need to be searched to confirm or refute the allegations at hand. A burdensome, costly task for humans, but one well suited to a tireless algorithm.

In our work, TDM serves as a valuable complement to the other methods, even if it did not make any significantly different findings. And, these results are more akin to the words one would use to describe data science as a profession to laymen. Whereas the other results tended to focus on data-science specific terms, the TDM results highlighted general as well as specific skills, like management.

—————————————————————————————————————————————————————————————

Conclusion

Three different approaches were applied for data collection and analysis. Analysis of these techniques yield comparable results with top 5 skills oriented towards `Big Data, Statistics, R, Machine Learning, and Python. Big Data skills were emphasized in Press and TDM, while Machine Learning ranked top in Twitter Collection method. These results of the current analysis should not minimize the importance some data science skills over others. The importance of data science skills likely varies by the requirements of the project. Skills can not be important unless they are required by the job parameters.

There is a dizzying array of buzzwords and skillsets linked to data science; it can be hard to know which ones to pursue, and which to dismiss. Our findings suggest that a few core disciplines underlie the field, and great facility with them is essential. These fields are, broadly, mathematics and computer science. Certain sub-fields within them, like statistics and machine learning, may deserve special study.

These subjects should be mastered in concert with the tools of the trade - these currently include R, Python, SQL and other tools. A mathematician without a command of a modern computer language will find it difficult to do data science work. And one should be careful to note that these tools are apt to change, that they are simply the latest incarnation of the methods of data science.

Lastly, one must have the personal qualities to make full use of these tools and subjects. You must be able to write clearly, and to express your findings in intuitive visual aides; you must have a penchant for learning new things in strange domains while keeping a close eye on your work.

And from a business perspective, data science is essential to turning data into insight. But, it can be hard to find experts with all the necessary talents - they may need to be groomed from within; if one is found, they must be kept engaged, and have the autonomy to design their own solutions, to turn data into prosperity, and transform the fate of a firm.

Statistical Analysis and Modeling

Part 1: The Gift of Twitter Gab

Step 1: Load the Data

Step 2: Aggregating & Tidying the Data

Let’s Kick Out the Losers: Skills with Zero Mentions

Step 3: Sorting the Data

Part II: Mentions of Data Science Skills in the Press

Step 1: Loading the data

Step 2: Aggregating & Winnowing the Mentions Across Articles

Step 3: Sorting the Data

Part III: Term Document Matrix - Most Frequently Used Words in the Press

Step 4: Visualizing the Data

In List Form, the 15 Most Mentioned Skills On Twitter:

The 15 Most Mentioned Skills in the Press:

Top 15 Data Science Skills in the Press

Data Science Skills in Wordcloud Form, via the Press

TDM Results

Analysis of Results

Skill Type I: Subject Mastery

Skill Type 2: Tools

Skill Type 3: Personal Traits

TDM Discussion

Conclusion

Further Reading: A Salary Survey