—————————————————————————————————————————————————————————————
We took to Twitter and scraped it for the number of times various data science related skills were mentioned, four times on three different dates: March 16, 19 and 20.
We have also scraped the contents of 20 articles for the number of mentions of these same skills, which we’ll discuss in part two, followed by discussion of the TDM matrix in part three. The data aggregation will be completed separately, but the analysis of the three parts will be done jointly.
## Loading required package: plyr
## Loading required package: RColorBrewer
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:graphics':
##
## layout
We took to Twitter and scraped it for the number of times various data science related skills were mentioned, on three different dates, March 16, 19 and 20.
We have also scraped the contents of 20 articles for the number of mentions of these same skills, but we’ll leave that for part two.
We’ll load in the Twitter frequency data from our online repository; since the first column is merely line numbers, we can excise that.
twitter.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/frequency_results.csv")
twitter <- read.csv(twitter.url, stringsAsFactors = FALSE, sep = ",")
twitter <- twitter[,2:4]
View(twitter)
kable(head(twitter))
| skill_id | t_freq | dates |
|---|---|---|
| 1 | 0 | 2016-03-16 |
| 2 | 0 | 2016-03-16 |
| 3 | 0 | 2016-03-16 |
| 4 | 0 | 2016-03-16 |
| 5 | 5 | 2016-03-16 |
| 6 | 0 | 2016-03-16 |
kable(tail(twitter))
| skill_id | t_freq | dates | |
|---|---|---|---|
| 591 | 144 | 0 | 2016-03-20 |
| 592 | 145 | 0 | 2016-03-20 |
| 593 | 146 | 0 | 2016-03-20 |
| 594 | 147 | 0 | 2016-03-20 |
| 595 | 148 | 0 | 2016-03-20 |
| 596 | 149 | 0 | 2016-03-20 |
The skills are simply numbered from 1 onwards - we see that there are 149 skills.
Looking at the above tables, we see have multiple dates worth of data. Since these days are such days apart, it’s not worth analyzing any temporal trend - eg, how certain skills have become more or less popular over time. If we had gathered data years apart, then that would have been a more fruitful exercise.
Let’s combine that data into one table of 149 rows, along with the respective skill titles Our skill titles are stored in a separate table.
First, we subset the data table into its respective portions, and load the skill titles table.
twitPart1 <- subset(twitter, dates == "2016-03-16")
twitPart2 <- subset(twitter, dates == "2016-03-19")[1:149, ]
twitPart3 <- subset(twitter, dates == "2016-03-19")[150:298, ]
twitPart4 <- subset(twitter, dates == "2016-03-20")
skillTitle.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/skillsAsher.csv") # The URL where the file listing the skill titles is located.
skillTitle <- read.csv(skillTitle.url, stringsAsFactors = FALSE) # Reads the skill titles into R.
twitAllDates <- data.frame(skill_id = twitPart1$skill_id, t_freq = twitPart1$t_freq + twitPart2$t_freq + twitPart3$t_freq + twitPart4$t_freq, skill_title = skillTitle$skill_name, stringsAsFactors = FALSE)
Looking at the data, we see that a number of skills don’t get mentioned at all. How many?
zeroTwits <-subset(twitAllDates, twitAllDates$t_freq == 0) # The subset of frequencies that are zero
nrow(twitAllDates) #The total number of skills
## [1] 149
nrow(zeroTwits) #The number of skills with zero mentions
## [1] 84
nrow(zeroTwits)/nrow(twitAllDates) #The proportion of skills with zero mentions
## [1] 0.5637584
About 56% of the skills we searched for were never mentioned. To reduce the chance of leaving out an important skill, we clearly included lots of skills that were not commonly talked about. That so many of our skills garnered no mention is not a cause for concern here. You could say our search was sensitive, but not specific :).
However, it’s important to note, that where there is a significant cost associated with gathering data, one must be more judicious about selecting what data to gather - then you can’t just dream up a Christmas wishlist of variables and ask for it all.
From here, we’ll limit our investigation to the skills with positive frequencies, hereafter twitPositive.
twitPositive <- subset(twitAllDates, twitAllDates$t_freq > 0)
twitSort <- twitPositive[order(-twitPositive$t_freq), ] #Sort results by frequency, descending
View(twitSort)
We’ll have to do some minor cleaning - we have both ‘machinelearning’ and ‘machine learning’ in our dataset, and they both rank very high. We’ll combine them into one entry, named ‘ML,’ because otherwise it will be long and make plotting natively in R more troublesome.
Also, one of our top results has a title that is too long, ‘predictive analytics’ - that will be shortened to ‘pred. analysis.’
MLRowNum <- which(twitSort$skill_title == "machinelearning") #The row number of machinelearning
M_LRowNum <- which(twitSort$skill_title == "machine learning")#The row number of machine learning (i.e. with a space in between)
twitSort$skill_title[MLRowNum] <- "ML" #Renames machinelearning to ML
twitSort$t_freq[MLRowNum] <- twitSort$t_freq[MLRowNum] + twitSort$t_freq[M_LRowNum] #Sums the two ML frequencies together
twitSort <- twitSort[-M_LRowNum, ] #Deletes the duplicate row
#which(twitSort$skill_title == "machine learning")
PARowNum <- which(twitSort$skill_title == "predictive analytics")
twitSort$skill_title[PARowNum] <- "pred. analysis"
View(twitSort)
Before we begin to visualize and understand the data, we’ll do all the previous steps again for the mentions we gathered from published articles. Then we’ll compare the data from each source side by side.
We’ll replicate the process we did with Twitter for our dataset of mentions in the press. Our sample is 91 published articles. We checked each article against a similar list of skills, to count the number of mentions.
Our datasets were formatted slightly differently, because different teams were involved, and no standards were hashed out beforehand; it’s not a big deal here, because the data is simple and we are doing this project as a ‘one-off’ but if this were a routine activity, we’d want to ensure the datasets were formatted identically, so we’d want to spend some time standardizing the data output format before gathering the data.
articleURL <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/Build-URL_DataFrame-Output.csv")
articleData <- read.csv(articleURL, stringsAsFactors = FALSE, sep = ",")
View(head(articleData))
We’ll want to aggregate the number of mentions in each article to a grand total of sums across articles. We’ll use the aggregate function.
articleAgg <- aggregate(articleData$ds_freq, by=list(Category= articleData$skill_name), FUN=sum)
names(articleAgg) <- c("skill", "frequency")
View(articleAgg)
With a simple function, we’ve consolidated our ~14,000 lines of data into 149. We can winnow this data down further, by removing the skills that did not garner a single mention.
articlePositive <- subset(articleAgg, articleAgg$frequency > 0)
nrow(articlePositive)
## [1] 115
View(articlePositive)
Now we have 115 skills with at least one mention in the articles we studied. 34 skills out of 149, or 23%, were not mentioned once.
Now, we can start looking at the most frequently mentioned skills. We’ll sort the skills according to their number of mentions, in descending order:
articleSort <- articlePositive[order(-articlePositive$frequency), ] #Sort results by frequency, descending
kable(head(articleSort))
| skill | frequency | |
|---|---|---|
| 14 | big data | 704 |
| 132 | Statistics | 359 |
| 111 | R | 323 |
| 75 | Machine Learning | 297 |
| 54 | Hadoop | 272 |
| 109 | programming | 246 |
The output of the term document matrix was posted on Github, from where we will retrieve it. This has the counts of each word in each article; we will aggregate these word totals into a sum for each word across all articles, and sort them in descending order of frequency.
tdmURL <- "https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/term-document-matrix/tdm-df"
tdmData <- read.csv(url(tdmURL), stringsAsFactors = FALSE, sep = ",")
names(tdmData) <- c("article", "term", "freq")
tdmAgg <- aggregate(tdmData$freq, by=list(Category= tdmData$term), FUN=sum)
tdmSort <- tdmAgg[order(-tdmAgg$x), ] #Sort results by frequency, descending
tdmSort$rank <- seq(1:nrow(tdmSort))
We’ll bring back our Twitter data here, so that we can look at the results of all three datasets together.
We’ll consider all three sets of results in barplot form, and use the rainbow() function to distinguish one bar from another.
barplot(twitSort$t_freq, main = "Twitter Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz = TRUE, col = rainbow(nrow(twitSort)))
barplot(articleSort$frequency, main = "Press Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(articleSort)))
barplot(tdmSort$x, main = "Most Used Words in the Press, via TDM", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(tdmSort)))
While this is useful for getting a sense of the distribution of skills, there are simply too many skills to put in a single graph. But these graphs do convey that some skills get many more mentions than others - even after we’ve removed the skills with zero mentions.
Let’s take a look at the top 15 skills
kable(twitSort[1:15, 3])
| ML |
| c |
| big data |
| python |
| statistics |
| research |
| infographic |
| innovation |
| pandas |
| spark |
| hadoop |
| visualization |
| programming |
| regression |
| pred. analysis |
kable(articleSort[1:15, 1])
| big data |
| Statistics |
| R |
| Machine Learning |
| Hadoop |
| programming |
| Python |
| Visualization |
| Data Mining |
| Research |
| SQL |
| Java |
| communication |
| C++ |
| C |
top15 <- articleSort[1:15, ]
p <- qplot(skill, frequency, data = top15, color = skill)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))
set.seed(1234)
wordcloud(words = articleSort$skill, freq = articleSort$frequency, rot.per=0.45, colors=brewer.pal(8, "Dark2"))
Below follow the 25 most frequent terms in the articles searched.
kable(tdmSort[1:25, ])
| Category | x | rank | |
|---|---|---|---|
| 245 | data | 5291 | 1 |
| 364 | function | 1900 | 2 |
| 776 | s | 1479 | 3 |
| 786 | science | 1374 | 4 |
| 1036 | var | 1319 | 5 |
| 257 | data science | 1107 | 6 |
| 834 | skills | 1069 | 7 |
| 1001 | true | 904 | 8 |
| 793 | scientist | 877 | 9 |
| 258 | data scientist | 791 | 10 |
| 610 | new | 776 | 11 |
| 332 | false | 720 | 12 |
| 288 | document | 668 | 13 |
| 167 | business | 657 | 14 |
| 796 | scientists | 650 | 15 |
| 157 | big | 648 | 16 |
| 725 | px | 618 | 17 |
| 259 | data scientists | 594 | 18 |
| 296 | e | 574 | 19 |
| 45 | a data | 567 | 20 |
| 1071 | window | 561 | 21 |
| 158 | big data | 554 | 22 |
| 636 | of the | 554 | 23 |
| 36 | var | 531 | 24 |
| 91 | analytics | 531 | 25 |
Below is a list of data science skills parsed from within the top 200 mentions of the TDM results:
kable(tdmSort[c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163), ])
| Category | x | rank | |
|---|---|---|---|
| 158 | big data | 554 | 22 |
| 91 | analytics | 531 | 25 |
| 873 | statistics | 329 | 51 |
| 86 | analysis | 288 | 55 |
| 714 | programming | 233 | 74 |
| 403 | hadoop | 229 | 76 |
| 566 | management | 218 | 78 |
| 871 | statistical | 199 | 88 |
| 726 | python | 183 | 104 |
| 325 | experience | 181 | 105 |
| 588 | mining | 175 | 112 |
| 537 | 158 | 128 | |
| 256 | data mining | 151 | 137 |
| 1046 | visualization | 150 | 139 |
| 198 | cloud | 136 | 163 |
In barplot form, the top 15:
tdm15 <- rev(c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163))
op <- par(mar = c(4.2,6.3,3,2) + 0.1)
barplot(tdmSort$x[tdm15], main = "The 15 Most Mentioned Data Science Terms, via TDM", xlab = "# of Mentions", names.arg = tdmSort$Category[tdm15], las = 2, col = rainbow(15), cex.names = 0.8, horiz = TRUE)
par(op)
The results picked up some javascript tags, but those have been removed.
We see that big data is a clear winner, especially in the press results. But big data has come to be a byword, even a synonym for data science; indeed, one of the main differences between data science and conventional statistical techniques is that data science handles ‘big’ datasets generated by software, whereas statistics tends to focus on smaller or simpler samples, or otherwise refers to the specific mathematical techniques used to analyze the data - and not the earlier processes of gathering and tidying the data.
Big data has also become a buzzword in its own right, one of dubious distinction critics say. The work it describes spans the fields of applied math and computer science - so it can’t be lumped into a bigger field like statistics or computer science. Regardless, it’s safe to conclude that to be a data scientist, one must be comfortable with big data.
Next, at #2 in the press results, we have statistics. Statistics is a core competency of data science - as critical as reflexes are to a racecar driver.
In the Twitter data, the results are fairly consistent from day to day, with machine learning coming out on top; the other skills are closer to one another, so their relative frequency fluctuates a bit.
We could go on and classify each skill, but looking at each of these skills, patterns emerge; we can put them into one of three categories:
Areas of competency are the general subjects and activities that transcend any specific piece of software or field, and are important or essential to being a good data scientist. These are things like numeracy and literacy and fluency in logic and visual aides. These skills are timeless, at least as long as the data scientists are human. Most of these are general, but some are more specific, like machine learning and regression.
It is possible, however, that some of these skills may wax and wane in importance as the tools for doing data science improve. Perhaps you manage to set up a data collection system where little to no data tidying is needed; or some advanced form of artificial intelligence makes creating compelling charts much easier. Certainly, the relative value of these competencies depends on the subject of your work.
These are the specific pieces of software data scientists use to do their jobs - and this is where you’ll see the most change over time, in what is and isn’t fashionable and in demand. Foremost among these are R, Python, Hadoop, SQL and other relational databases. Change is especially swift in areas where software can be changed easily - where organizations aren’t tied up using legacy hardware and software.
Still, some of these tools are staples of the data scientist, and will be for some time. R and Python are two obvious candidates. And even when such tools are replaced, facility with the old ones will help to understand the new ones that replace them. If say, some other database software came to the fore, experience with SQL would ease the transition to the new program.
Different occupations call for different character traits. You hope that your surgeon is careful enough not to amputate the wrong leg, and that your favorite chef has good hygiene. Data scientists too have certain preferred qualities.
Calling these items ‘traits’ can make them sound hard to change - but they can all be improved with focused practice and study, especially with the abundance of tools that technology offers.
This method yields many words that may be common in describing data science - like the word ‘data’ - but they aren’t a skill per se.
The TDM approach is valuable because it will highlight terms that weren’t explicitly searched. If one had no prior information on what skills were useful, the TDM approach would be a valuable tool. In a sense, TDM lets the data ‘tell’ us what is most important. But it still requires human oversight; the results of TDM must be parsed manually, to winnow through the frequent but irrelevant terms. Similar terms like statistics and statistical may be worth combining to determine a more accurate result.
This refining process can easily yield differing results. For instance, in our top 200 terms, what terms to pick out as skills is somewhat arbitrary - python is clearly a skill, but words like analysis, training, insights, experience, research, search, team are a bit more ambiguous - but even those are suggestive of what makes a successful data scientist.
And, as TDM’s large accompanying dataset suggests, it is more computationally intensive. With a larger corpus of work to search, the TDM approach would have to be altered to remain doable.
Of course, the TDM method employed is a simple one. It could be paired with say, a machine learning algorithm, with which it ‘learns’ to distinguish common but unhelpful words like “data” from helpful ones like “statistics.”
And in fact, such search methods have become commonplace in the legal industry, in a process known as discovery; in a given lawsuit, thousands of pages and emails might need to be searched to confirm or refute the allegations at hand. A burdensome, costly task for humans, but one well suited to a tireless algorithm.
In our work, TDM serves as a valuable complement to the other methods, even if it did not make any significantly different findings. And, these results are more akin to the words one would use to describe data science as a profession to laymen. Whereas the other results tended to focus on data-science specific terms, the TDM results highlighted general as well as specific skills, like management.
—————————————————————————————————————————————————————————————
Three different approaches were applied for data collection and analysis. Analysis of these techniques yield comparable results with top 5 skills oriented towards `Big Data, Statistics, R, Machine Learning, and Python. Big Data skills were emphasized in Press and TDM, while Machine Learning ranked top in Twitter Collection method. These results of the current analysis should not minimize the importance some data science skills over others. The importance of data science skills likely varies by the requirements of the project. Skills can not be important unless they are required by the job parameters.
There is a dizzying array of buzzwords and skillsets linked to data science; it can be hard to know which ones to pursue, and which to dismiss. Our findings suggest that a few core disciplines underlie the field, and great facility with them is essential. These fields are, broadly, mathematics and computer science. Certain sub-fields within them, like statistics and machine learning, may deserve special study.
These subjects should be mastered in concert with the tools of the trade - these currently include R, Python, SQL and other tools. A mathematician without a command of a modern computer language will find it difficult to do data science work. And one should be careful to note that these tools are apt to change, that they are simply the latest incarnation of the methods of data science.
Lastly, one must have the personal qualities to make full use of these tools and subjects. You must be able to write clearly, and to express your findings in intuitive visual aides; you must have a penchant for learning new things in strange domains while keeping a close eye on your work.
And from a business perspective, data science is essential to turning data into insight. But, it can be hard to find experts with all the necessary talents - they may need to be groomed from within; if one is found, they must be kept engaged, and have the autonomy to design their own solutions, to turn data into prosperity, and transform the fate of a firm.
O’Reilly Media published a salary survey that sought to determine what skills are in demand by data scientists. Commendably, the study uses regression analysis to qualify the importance of various attributes, from geography to education to prowess with specific tools. It has an interesting comparison of salaries based on what sorts of tasks data scientists do, from extract, transform and load (ETL) work to meetings and exploratory data analysis.
The O’Reilly survey gives insight into the different career trajectories available to data scientists, as well as recent changes in the popularity of various tools of the trade. While our results seem more pointed at the fundamentals of the field, the repeated annual surveys of O’Reilly Media allow them to uncover temporal trends in the data science field.