—————————————————————————————————————————————————————————————

Statistical Analysis and Modeling

Part 1: The Gift of Twitter Gab

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, four times on three different dates: March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, which we’ll discuss in part two, followed by discussion of the TDM matrix in part three. The data aggregation will be completed separately, but the analysis of the three parts will be done jointly.

## Loading required package: plyr
## Loading required package: RColorBrewer
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:graphics':
## 
##     layout

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, on three different dates, March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, but we’ll leave that for part two.

Step 1: Load the Data

We’ll load in the Twitter frequency data from our online repository; since the first column is merely line numbers, we can excise that.

twitter.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/frequency_results.csv")

twitter <- read.csv(twitter.url, stringsAsFactors = FALSE, sep = ",")
twitter <- twitter[,2:4]

View(twitter)
kable(head(twitter))
skill_id t_freq dates
1 0 2016-03-16
2 0 2016-03-16
3 0 2016-03-16
4 0 2016-03-16
5 5 2016-03-16
6 0 2016-03-16
kable(tail(twitter))
skill_id t_freq dates
591 144 0 2016-03-20
592 145 0 2016-03-20
593 146 0 2016-03-20
594 147 0 2016-03-20
595 148 0 2016-03-20
596 149 0 2016-03-20

The skills are simply numbered from 1 onwards - we see that there are 149 skills.

Step 2: Aggregating & Tidying the Data

Looking at the above tables, we see have multiple dates worth of data. Since these days are such days apart, it’s not worth analyzing any temporal trend - eg, how certain skills have become more or less popular over time. If we had gathered data years apart, then that would have been a more fruitful exercise.

Let’s combine that data into one table of 149 rows, along with the respective skill titles Our skill titles are stored in a separate table.

First, we subset the data table into its respective portions, and load the skill titles table.

twitPart1 <- subset(twitter, dates == "2016-03-16")
twitPart2 <- subset(twitter, dates == "2016-03-19")[1:149, ]
twitPart3 <- subset(twitter, dates == "2016-03-19")[150:298, ]
twitPart4 <- subset(twitter, dates == "2016-03-20")

skillTitle.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/skillsAsher.csv") # The URL where the file listing the skill titles is located.
skillTitle <- read.csv(skillTitle.url, stringsAsFactors = FALSE) # Reads the skill titles into R.

twitAllDates <- data.frame(skill_id = twitPart1$skill_id, t_freq = twitPart1$t_freq + twitPart2$t_freq + twitPart3$t_freq + twitPart4$t_freq, skill_title = skillTitle$skill_name, stringsAsFactors = FALSE)
Let’s Kick Out the Losers: Skills with Zero Mentions

Looking at the data, we see that a number of skills don’t get mentioned at all. How many?

zeroTwits <-subset(twitAllDates, twitAllDates$t_freq == 0) # The subset of frequencies that are zero
nrow(twitAllDates) #The total number of skills
## [1] 149
nrow(zeroTwits) #The number of skills with zero mentions
## [1] 84
nrow(zeroTwits)/nrow(twitAllDates) #The proportion of skills with zero mentions
## [1] 0.5637584

About 56% of the skills we searched for were never mentioned. To reduce the chance of leaving out an important skill, we clearly included lots of skills that were not commonly talked about. That so many of our skills garnered no mention is not a cause for concern here. You could say our search was sensitive, but not specific :).

However, it’s important to note, that where there is a significant cost associated with gathering data, one must be more judicious about selecting what data to gather - then you can’t just dream up a Christmas wishlist of variables and ask for it all.

Step 3: Sorting the Data

From here, we’ll limit our investigation to the skills with positive frequencies, hereafter twitPositive.

twitPositive <- subset(twitAllDates, twitAllDates$t_freq > 0)
twitSort <- twitPositive[order(-twitPositive$t_freq), ] #Sort results by frequency, descending
View(twitSort)

We’ll have to do some minor cleaning - we have both ‘machinelearning’ and ‘machine learning’ in our dataset, and they both rank very high. We’ll combine them into one entry, named ‘ML,’ because otherwise it will be long and make plotting natively in R more troublesome.

Also, one of our top results has a title that is too long, ‘predictive analytics’ - that will be shortened to ‘pred. analysis.’

MLRowNum <- which(twitSort$skill_title == "machinelearning") #The row number of machinelearning
M_LRowNum <- which(twitSort$skill_title == "machine learning")#The row number of machine learning (i.e. with a space in between)
twitSort$skill_title[MLRowNum] <- "ML" #Renames machinelearning to ML
twitSort$t_freq[MLRowNum] <- twitSort$t_freq[MLRowNum] + twitSort$t_freq[M_LRowNum] #Sums the two ML frequencies together
twitSort <- twitSort[-M_LRowNum, ] #Deletes the duplicate row
#which(twitSort$skill_title == "machine learning")


PARowNum <- which(twitSort$skill_title == "predictive analytics")
twitSort$skill_title[PARowNum] <- "pred. analysis"

View(twitSort)

Before we begin to visualize and understand the data, we’ll do all the previous steps again for the mentions we gathered from published articles. Then we’ll compare the data from each source side by side.


Part II: Mentions of Data Science Skills in the Press

We’ll replicate the process we did with Twitter for our dataset of mentions in the press. Our sample is 91 published articles. We checked each article against a similar list of skills, to count the number of mentions.

Our datasets were formatted slightly differently, because different teams were involved, and no standards were hashed out beforehand; it’s not a big deal here, because the data is simple and we are doing this project as a ‘one-off’ but if this were a routine activity, we’d want to ensure the datasets were formatted identically, so we’d want to spend some time standardizing the data output format before gathering the data.

Step 1: Loading the data

articleURL <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/Build-URL_DataFrame-Output.csv")
articleData <- read.csv(articleURL, stringsAsFactors = FALSE, sep = ",")
View(head(articleData))

Step 2: Aggregating & Winnowing the Mentions Across Articles

We’ll want to aggregate the number of mentions in each article to a grand total of sums across articles. We’ll use the aggregate function.

articleAgg <- aggregate(articleData$ds_freq, by=list(Category= articleData$skill_name), FUN=sum)
names(articleAgg) <- c("skill", "frequency")

View(articleAgg)

With a simple function, we’ve consolidated our ~14,000 lines of data into 149. We can winnow this data down further, by removing the skills that did not garner a single mention.

articlePositive <- subset(articleAgg, articleAgg$frequency > 0)
nrow(articlePositive)
## [1] 115
View(articlePositive)

Now we have 115 skills with at least one mention in the articles we studied. 34 skills out of 149, or 23%, were not mentioned once.

Step 3: Sorting the Data

Now, we can start looking at the most frequently mentioned skills. We’ll sort the skills according to their number of mentions, in descending order:

articleSort <- articlePositive[order(-articlePositive$frequency), ] #Sort results by frequency, descending
kable(head(articleSort))
skill frequency
14 big data 704
132 Statistics 359
111 R 323
75 Machine Learning 297
54 Hadoop 272
109 programming 246

Part III: Term Document Matrix - Most Frequently Used Words in the Press

The output of the term document matrix was posted on Github, from where we will retrieve it. This has the counts of each word in each article; we will aggregate these word totals into a sum for each word across all articles, and sort them in descending order of frequency.

tdmURL <- "https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/term-document-matrix/tdm-df"

tdmData <- read.csv(url(tdmURL), stringsAsFactors = FALSE, sep = ",")
names(tdmData) <- c("article", "term", "freq")

tdmAgg <- aggregate(tdmData$freq, by=list(Category= tdmData$term), FUN=sum)
tdmSort <- tdmAgg[order(-tdmAgg$x), ] #Sort results by frequency, descending
tdmSort$rank <- seq(1:nrow(tdmSort))

Step 4: Visualizing the Data

We’ll bring back our Twitter data here, so that we can look at the results of all three datasets together.

We’ll consider all three sets of results in barplot form, and use the rainbow() function to distinguish one bar from another.

barplot(twitSort$t_freq, main = "Twitter Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz = TRUE, col = rainbow(nrow(twitSort)))

barplot(articleSort$frequency, main = "Press Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(articleSort)))

barplot(tdmSort$x, main = "Most Used Words in the Press, via TDM", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(tdmSort)))

While this is useful for getting a sense of the distribution of skills, there are simply too many skills to put in a single graph. But these graphs do convey that some skills get many more mentions than others - even after we’ve removed the skills with zero mentions.

Let’s take a look at the top 15 skills

In List Form, the 15 Most Mentioned Skills On Twitter:
kable(twitSort[1:15, 3])
ML
c
big data
python
statistics
research
infographic
innovation
pandas
spark
hadoop
visualization
programming
regression
pred. analysis
The 15 Most Mentioned Skills in the Press:
kable(articleSort[1:15, 1])
big data
Statistics
R
Machine Learning
Hadoop
programming
Python
Visualization
Data Mining
Research
SQL
Java
communication
C++
C

Top 15 Data Science Skills in the Press
top15 <- articleSort[1:15, ]


p <- qplot(skill, frequency, data = top15, color = skill)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Data Science Skills in Wordcloud Form, via the Press

set.seed(1234)
wordcloud(words = articleSort$skill, freq = articleSort$frequency, rot.per=0.45, colors=brewer.pal(8, "Dark2"))

TDM Results

Below follow the 25 most frequent terms in the articles searched.

kable(tdmSort[1:25, ])
Category x rank
245 data 5291 1
364 function 1900 2
776 s 1479 3
786 science 1374 4
1036 var 1319 5
257 data science 1107 6
834 skills 1069 7
1001 true 904 8
793 scientist 877 9
258 data scientist 791 10
610 new 776 11
332 false 720 12
288 document 668 13
167 business 657 14
796 scientists 650 15
157 big 648 16
725 px 618 17
259 data scientists 594 18
296 e 574 19
45 a data 567 20
1071 window 561 21
158 big data 554 22
636 of the 554 23
36 var 531 24
91 analytics 531 25

Below is a list of data science skills parsed from within the top 200 mentions of the TDM results:

kable(tdmSort[c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163), ])
Category x rank
158 big data 554 22
91 analytics 531 25
873 statistics 329 51
86 analysis 288 55
714 programming 233 74
403 hadoop 229 76
566 management 218 78
871 statistical 199 88
726 python 183 104
325 experience 181 105
588 mining 175 112
537 linkedin 158 128
256 data mining 151 137
1046 visualization 150 139
198 cloud 136 163

In barplot form, the top 15:

tdm15 <- rev(c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163))

op <- par(mar = c(4.2,6.3,3,2) + 0.1)
barplot(tdmSort$x[tdm15], main = "The 15 Most Mentioned Data Science Terms, via TDM", xlab = "# of Mentions", names.arg = tdmSort$Category[tdm15], las = 2, col = rainbow(15), cex.names = 0.8, horiz = TRUE)

par(op)

The results picked up some javascript tags, but those have been removed.

Analysis of Results

We see that big data is a clear winner, especially in the press results. But big data has come to be a byword, even a synonym for data science; indeed, one of the main differences between data science and conventional statistical techniques is that data science handles ‘big’ datasets generated by software, whereas statistics tends to focus on smaller or simpler samples, or otherwise refers to the specific mathematical techniques used to analyze the data - and not the earlier processes of gathering and tidying the data.

Big data has also become a buzzword in its own right, one of dubious distinction critics say. The work it describes spans the fields of applied math and computer science - so it can’t be lumped into a bigger field like statistics or computer science. Regardless, it’s safe to conclude that to be a data scientist, one must be comfortable with big data.

Next, at #2 in the press results, we have statistics. Statistics is a core competency of data science - as critical as reflexes are to a racecar driver.

In the Twitter data, the results are fairly consistent from day to day, with machine learning coming out on top; the other skills are closer to one another, so their relative frequency fluctuates a bit.

We could go on and classify each skill, but looking at each of these skills, patterns emerge; we can put them into one of three categories:

Skill Type I: Subject Mastery

Areas of competency are the general subjects and activities that transcend any specific piece of software or field, and are important or essential to being a good data scientist. These are things like numeracy and literacy and fluency in logic and visual aides. These skills are timeless, at least as long as the data scientists are human. Most of these are general, but some are more specific, like machine learning and regression.

It is possible, however, that some of these skills may wax and wane in importance as the tools for doing data science improve. Perhaps you manage to set up a data collection system where little to no data tidying is needed; or some advanced form of artificial intelligence makes creating compelling charts much easier. Certainly, the relative value of these competencies depends on the subject of your work.

Skill Type 2: Tools

These are the specific pieces of software data scientists use to do their jobs - and this is where you’ll see the most change over time, in what is and isn’t fashionable and in demand. Foremost among these are R, Python, Hadoop, SQL and other relational databases. Change is especially swift in areas where software can be changed easily - where organizations aren’t tied up using legacy hardware and software.

Still, some of these tools are staples of the data scientist, and will be for some time. R and Python are two obvious candidates. And even when such tools are replaced, facility with the old ones will help to understand the new ones that replace them. If say, some other database software came to the fore, experience with SQL would ease the transition to the new program.

Skill Type 3: Personal Traits

Different occupations call for different character traits. You hope that your surgeon is careful enough not to amputate the wrong leg, and that your favorite chef has good hygiene. Data scientists too have certain preferred qualities.

  • Communication: Data scientists should be able to express themselves clearly in non-technical terms, to other people that are data scientists and especially to those who aren’t.
  • Visualization: The data scientist should excel at using visual aides to make his points.
  • Perlustration: A data scientist must carefully and constantly examine her data for quirks and mishaps; if such irregularities go unseen, it could throw off the whole analysis.
  • Curiousity: He must want to learn, both within his domain and without; this will help him find new, useful sources of data, and complete the research necessary. A lack of curiosity and a compartmentalized perspective could lead to missed solutions and stagnation, where innovation is required.

Calling these items ‘traits’ can make them sound hard to change - but they can all be improved with focused practice and study, especially with the abundance of tools that technology offers.

TDM Discussion

This method yields many words that may be common in describing data science - like the word ‘data’ - but they aren’t a skill per se.

The TDM approach is valuable because it will highlight terms that weren’t explicitly searched. If one had no prior information on what skills were useful, the TDM approach would be a valuable tool. In a sense, TDM lets the data ‘tell’ us what is most important. But it still requires human oversight; the results of TDM must be parsed manually, to winnow through the frequent but irrelevant terms. Similar terms like statistics and statistical may be worth combining to determine a more accurate result.

This refining process can easily yield differing results. For instance, in our top 200 terms, what terms to pick out as skills is somewhat arbitrary - python is clearly a skill, but words like analysis, training, insights, experience, research, search, team are a bit more ambiguous - but even those are suggestive of what makes a successful data scientist.

And, as TDM’s large accompanying dataset suggests, it is more computationally intensive. With a larger corpus of work to search, the TDM approach would have to be altered to remain doable.

Of course, the TDM method employed is a simple one. It could be paired with say, a machine learning algorithm, with which it ‘learns’ to distinguish common but unhelpful words like “data” from helpful ones like “statistics.”

And in fact, such search methods have become commonplace in the legal industry, in a process known as discovery; in a given lawsuit, thousands of pages and emails might need to be searched to confirm or refute the allegations at hand. A burdensome, costly task for humans, but one well suited to a tireless algorithm.

In our work, TDM serves as a valuable complement to the other methods, even if it did not make any significantly different findings. And, these results are more akin to the words one would use to describe data science as a profession to laymen. Whereas the other results tended to focus on data-science specific terms, the TDM results highlighted general as well as specific skills, like management.

—————————————————————————————————————————————————————————————

Conclusion

Three different approaches were applied for data collection and analysis. Analysis of these techniques yield comparable results with top 5 skills oriented towards `Big Data, Statistics, R, Machine Learning, and Python. Big Data skills were emphasized in Press and TDM, while Machine Learning ranked top in Twitter Collection method. These results of the current analysis should not minimize the importance some data science skills over others. The importance of data science skills likely varies by the requirements of the project. Skills can not be important unless they are required by the job parameters.

There is a dizzying array of buzzwords and skillsets linked to data science; it can be hard to know which ones to pursue, and which to dismiss. Our findings suggest that a few core disciplines underlie the field, and great facility with them is essential. These fields are, broadly, mathematics and computer science. Certain sub-fields within them, like statistics and machine learning, may deserve special study.

These subjects should be mastered in concert with the tools of the trade - these currently include R, Python, SQL and other tools. A mathematician without a command of a modern computer language will find it difficult to do data science work. And one should be careful to note that these tools are apt to change, that they are simply the latest incarnation of the methods of data science.

Lastly, one must have the personal qualities to make full use of these tools and subjects. You must be able to write clearly, and to express your findings in intuitive visual aides; you must have a penchant for learning new things in strange domains while keeping a close eye on your work.

And from a business perspective, data science is essential to turning data into insight. But, it can be hard to find experts with all the necessary talents - they may need to be groomed from within; if one is found, they must be kept engaged, and have the autonomy to design their own solutions, to turn data into prosperity, and transform the fate of a firm.

Further Reading: A Salary Survey

O’Reilly Media published a salary survey that sought to determine what skills are in demand by data scientists. Commendably, the study uses regression analysis to qualify the importance of various attributes, from geography to education to prowess with specific tools. It has an interesting comparison of salaries based on what sorts of tasks data scientists do, from extract, transform and load (ETL) work to meetings and exploratory data analysis.

The O’Reilly survey gives insight into the different career trajectories available to data scientists, as well as recent changes in the popularity of various tools of the trade. While our results seem more pointed at the fundamentals of the field, the repeated annual surveys of O’Reilly Media allow them to uncover temporal trends in the data science field.