Exploratory Data Analysis

Twitter Data

Popularity Contest, Part 1: The Gift of Twitter Gab

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, four times on three different dates: March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, which we’ll discuss in part two, followed by discussion of the TDM matrix in part three. The data aggregation will be completed separately, but the analysis of the three parts will be done jointly.

## Loading required package: plyr
## Loading required package: RColorBrewer
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:graphics':
## 
##     layout

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, on three different dates, March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, but we’ll leave that for part two.

Step 1: Load the Data

We’ll load in the Twitter frequency data from our online repository; since the first column is merely line numbers, we can excise that.

twitter.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/frequency_results.csv")

twitter <- read.csv(twitter.url, stringsAsFactors = FALSE, sep = ",")
twitter <- twitter[,2:4]

View(twitter)
kable(head(twitter))
skill_id t_freq dates
1 0 2016-03-16
2 0 2016-03-16
3 0 2016-03-16
4 0 2016-03-16
5 5 2016-03-16
6 0 2016-03-16
kable(tail(twitter))
skill_id t_freq dates
591 144 0 2016-03-20
592 145 0 2016-03-20
593 146 0 2016-03-20
594 147 0 2016-03-20
595 148 0 2016-03-20
596 149 0 2016-03-20

The skills are simply numbered from 1 onwards - we see that there are 149 skills.

Step 2: Aggregating & Tidying the Data

Looking at the above tables, we see have multiple dates worth of data. Since these days are such days apart, it’s not worth analyzing any temporal trend - eg, how certain skills have become more or less popular over time. If we had gathered data years apart, then that would have been a more fruitful exercise.

Let’s combine that data into one table of 149 rows, along with the respective skill titles Our skill titles are stored in a separate table.

First, we subset the data table into its respective portions, and load the skill titles table.

twitPart1 <- subset(twitter, dates == "2016-03-16")
twitPart2 <- subset(twitter, dates == "2016-03-19")[1:149, ]
twitPart3 <- subset(twitter, dates == "2016-03-19")[150:298, ]
twitPart4 <- subset(twitter, dates == "2016-03-20")

skillTitle.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/skillsAsher.csv") # The URL where the file listing the skill titles is located.
skillTitle <- read.csv(skillTitle.url, stringsAsFactors = FALSE) # Reads the skill titles into R.

twitAllDates <- data.frame(skill_id = twitPart1$skill_id, t_freq = twitPart1$t_freq + twitPart2$t_freq + twitPart3$t_freq + twitPart4$t_freq, skill_title = skillTitle$skill_name, stringsAsFactors = FALSE)
Let’s Kick Out the Losers: Skills with Zero Mentions

Looking at the data, we see that a number of skills don’t get mentioned at all. How many?

zeroTwits <-subset(twitAllDates, twitAllDates$t_freq == 0) # The subset of frequencies that are zero
nrow(twitAllDates) #The total number of skills
## [1] 149
nrow(zeroTwits) #The number of skills with zero mentions
## [1] 84
nrow(zeroTwits)/nrow(twitAllDates) #The proportion of skills with zero mentions
## [1] 0.5637584

About 56% of the skills we searched for were never mentioned. To reduce the chance of leaving out an important skill, we clearly included lots of skills that were not commonly talked about. That so many of our skills garnered no mention is not a cause for concern here. You could say our search was sensitive, but not specific :).

However, it’s important to note, that where there is a significant cost associated with gathering data, one must be more judicious about selecting what data to gather - then you can’t just dream up a Christmas wishlist of variables and ask for it all.

Step 3: Sorting the Data

From here, we’ll limit our investigation to the skills with positive frequencies, hereafter twitPositive.

twitPositive <- subset(twitAllDates, twitAllDates$t_freq > 0)
twitSort <- twitPositive[order(-twitPositive$t_freq), ] #Sort results by frequency, descending
View(twitSort)

We’ll have to do some minor cleaning - we have both ‘machinelearning’ and ‘machine learning’ in our dataset, and they both rank very high. We’ll combine them into one entry, named ‘ML,’ because otherwise it will be long and make plotting natively in R more troublesome.

Also, one of our top results has a title that is too long, ‘predictive analytics’ - that will be shortened to ‘pred. analysis.’

MLRowNum <- which(twitSort$skill_title == "machinelearning") #The row number of machinelearning
M_LRowNum <- which(twitSort$skill_title == "machine learning")#The row number of machine learning (i.e. with a space in between)
twitSort$skill_title[MLRowNum] <- "ML" #Renames machinelearning to ML
twitSort$t_freq[MLRowNum] <- twitSort$t_freq[MLRowNum] + twitSort$t_freq[M_LRowNum] #Sums the two ML frequencies together
twitSort <- twitSort[-M_LRowNum, ] #Deletes the duplicate row
#which(twitSort$skill_title == "machine learning")


PARowNum <- which(twitSort$skill_title == "predictive analytics")
twitSort$skill_title[PARowNum] <- "pred. analysis"

View(twitSort)
In List Form, Twitter:
kable(twitSort[1:15, 3])
ML
c
big data
python
statistics
research
infographic
innovation
pandas
spark
hadoop
visualization
programming
regression
pred. analysis

Before we begin to visualize and understand the data, we’ll do all the previous steps again for the mentions we gathered from published articles. Then we’ll compare the data from each source side by side.


URL Data - The “Best Guess” Approach

Part II: Mentions of Data Science Skills in the Press

We’ll replicate the process we did with Twitter for our dataset of mentions in the press. Our sample is 91 published articles. We checked each article against a similar list of skills, to count the number of mentions.

Our datasets were formatted slightly differently, because different teams were involved, and no standards were hashed out beforehand; it’s not a big deal here, because the data is simple and we are doing this project as a ‘one-off’ but if this were a routine activity, we’d want to ensure the datasets were formatted identically, so we’d want to spend some time standardizing the data output format before gathering the data.

Step 1: Loading the data

articleURL <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/Build-URL_DataFrame-Output.csv")
articleData <- read.csv(articleURL, stringsAsFactors = FALSE, sep = ",")
View(head(articleData))

Step 2: Aggregating & Winnowing the Mentions Across Articles

We’ll want to aggregate the number of mentions in each article to a grand total of sums across articles. We’ll use the aggregate function.

articleAgg <- aggregate(articleData$ds_freq, by=list(Category= articleData$skill_name), FUN=sum)
names(articleAgg) <- c("skill", "frequency")

View(articleAgg)

With a simple function, we’ve consolidated our ~14,000 lines of data into 149. We can winnow this data down further, by removing the skills that did not garner a single mention.

articlePositive <- subset(articleAgg, articleAgg$frequency > 0)
nrow(articlePositive)
## [1] 115
View(articlePositive)

Now we have 115 skills with at least one mention in the articles we studied. 34 skills out of 149, or 23%, were not mentioned once.

Step 3: Sorting the Data

Now, we can start looking at the most frequently mentioned skills. We’ll sort the skills according to their number of mentions, in descending order:

articleSort <- articlePositive[order(-articlePositive$frequency), ] #Sort results by frequency, descending
kable(head(articleSort))
skill frequency
14 big data 704
132 Statistics 359
111 R 323
75 Machine Learning 297
54 Hadoop 272
109 programming 246
And the Press:
kable(articleSort[1:15, 1])
big data
Statistics
R
Machine Learning
Hadoop
programming
Python
Visualization
Data Mining
Research
SQL
Java
communication
C++
C

URL Data - The Term-Document Matrix Approach

Part III: Term Document Matrix - Most Frequently Used Words in the Press

The output of the term document matrix was posted on Github, from where we will retrieve it. This has the counts of each word in each article; we will aggregate these word totals into a sum for each word across all articles, and sort them in descending order of frequency.

tdmURL <- "https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/term-document-matrix/tdm-df"

tdmData <- read.csv(url(tdmURL), stringsAsFactors = FALSE, sep = ",")
names(tdmData) <- c("article", "term", "freq")

tdmAgg <- aggregate(tdmData$freq, by=list(Category= tdmData$term), FUN=sum)
tdmSort <- tdmAgg[order(-tdmAgg$x), ] #Sort results by frequency, descending
tdmSort$rank <- seq(1:nrow(tdmSort))

TDM Results

Below follow the 200 most frequent terms in the articles searched.

kable(tdmSort[1:200, ])
Category x rank
245 data 5291 1
364 function 1900 2
776 s 1479 3
786 science 1374 4
1036 var 1319 5
257 data science 1107 6
834 skills 1069 7
1001 true 904 8
793 scientist 877 9
258 data scientist 791 10
610 new 776 11
332 false 720 12
288 document 668 13
167 business 657 14
796 scientists 650 15
157 big 648 16
725 px 618 17
259 data scientists 594 18
296 e 574 19
45 a data 567 20
1071 window 561 21
158 big data 554 22
636 of the 554 23
36 var 531 24
91 analytics 531 25
176 can 530 26
997 top 490 27
736 r 470 28
437 id 463 29
455 in the 461 30
758 require 460 31
621 null 446 32
21 if 438 33
503 js 436 34
1008 type 436 35
24 new 431 36
1068 will 411 37
13 document 408 38
722 push 408 39
634 of data 407 40
1035 value 406 41
893 t 404 42
800 script 399 43
523 learning 390 44
429 http 384 45
1 – 365 46
766 return 359 47
861 src 346 48
17 function 339 49
501 jquery 335 50
873 statistics 329 51
497 job 311 52
601 need 303 53
650 one 296 54
86 analysis 288 55
446 important 288 56
929 the data 284 57
559 machine 281 58
37 window 279 59
431 https 275 60
200 com 273 61
995 tools 266 62
522 learn 265 63
533 like 263 64
970 to be 258 65
284 display 253 66
32 this 251 67
80 also 250 68
243 d 247 69
302 else 247 70
379 get 243 71
581 may 237 72
1023 use 236 73
714 programming 233 74
1021 url 233 75
403 hadoop 229 76
483 is a 223 77
566 management 218 78
1079 work 216 79
625 o 214 80
674 people 209 81
452 in data 205 82
986 to the 204 83
233 createelement 203 84
133 async 202 85
569 many 202 86
173 c 199 87
871 statistical 199 88
511 know 198 89
602 need to 195 90
528 length 194 91
342 find 192 92
425 how to 192 93
599 name 191 94
135 at 190 95
573 march 190 96
161 blog 186 97
846 software 186 98
7 at 185 99
391 good 184 100
413 height 184 101
344 first 183 102
450 in a 183 103
726 python 183 104
325 experience 181 105
563 make 181 106
831 skill 181 107
1087 www 180 108
466 information 179 109
556 m 179 110
1033 v 178 111
588 mining 175 112
162 body 173 113
507 just 172 114
211 companies 171 115
351 for data 171 116
516 language 171 117
612 news 170 118
791 science skills 170 119
1022 us 169 120
12 data 167 121
304 email 167 122
965 time 167 123
615 none 164 124
23 jquery 163 125
806 see 163 126
1032 using 162 127
537 linkedin 158 128
216 computer 157 129
648 on the 157 130
127 as a 156 131
1067 width 154 132
667 parentnode 153 133
35 true 152 134
494 january 152 135
151 become 151 136
256 data mining 151 137
548 log 151 138
1046 visualization 150 139
1076 with the 149 140
690 post 147 141
104 and the 146 142
193 class 146 143
337 field 146 144
29 s 145 145
547 location 145 146
744 read 145 147
125 article 144 148
195 click 144 149
88 analyst 143 150
412 head 143 151
241 customer 142 152
199 code 141 153
213 company 141 154
186 cdata 140 155
414 help 140 156
816 set 140 157
10 cdata 139 158
286 div 139 159
443 if you 139 160
499 jobs 139 161
281 different 137 162
198 cloud 136 163
206 comments 136 164
527 left 136 165
769 right 136 166
906 technology 136 167
488 is the 135 168
154 best 134 169
142 b 133 170
716 project 132 171
72 algorithms 131 172
630 of a 131 173
238 css 130 174
803 search 130 175
428 html 129 176
18 http 128 177
138 at the 128 178
472 insights 128 179
900 team 128 180
490 it is 127 181
960 think 127 182
1094 you can 127 183
14 else 125 184
938 the most 125 185
1083 world 125 186
28 return 124 187
469 insertbefore 124 188
538 list 124 189
620 now 124 190
762 research 122 191
967 title 122 192
144 background 121 193
745 ready 121 194
999 training 121 195
56 able 120 196
397 great 120 197
57 able to 119 198
229 course 119 199
279 development 119 200

Below is a list of data science skills within the top 200 mentions of the TDM results:

22 Big Data

25 Analytics

51 statistics

55 analysis

74 programming

76 hadoop

78 management

88 statistical

104 python

105 experience

112 mining

128 linkedin

137 data mining

139 visualization

163 cloud

172 algorithms

179 insights

180 team

191 research

195 training

200 development

Data Visualization

Visualizing the Data

We’ll bring back our Twitter data here, so that we can look at the results of all three datasets together.

We’ll consider all three sets of results in barplot form, and use the rainbow() function to distinguish one bar from another.

barplot(twitSort$t_freq, main = "Twitter Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz = TRUE, col = rainbow(nrow(twitSort)))

barplot(articleSort$frequency, main = "Press Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(articleSort)))

barplot(tdmSort$x, main = "Most Used Words in the Press, via TDM", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(tdmSort)))

While this is useful for getting a sense of the distribution of skills, there are simply too many skills to put in a single graph. But these graphs do convey that some skills get many more mentions than others - even after we’ve removed the skills with zero mentions.

Let’s take a look at the top 15 skills


Top 15 Data Science Skills in the Press
top15 <- articleSort[1:15, ]


p <- qplot(skill, frequency, data = top15, color = skill)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Data Science Skills in Wordcloud Form, via the Press

set.seed(1234)
wordcloud(words = articleSort$skill, freq = articleSort$frequency, rot.per=0.45, colors=brewer.pal(8, "Dark2"))

In barplot form, the top 15:

tdm15 <- rev(c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163))

op <- par(mar = c(4.2,6.3,3,2) + 0.1)
barplot(tdmSort$x[tdm15], main = "The 15 Most Mentioned Data Science Terms, via TDM", xlab = "# of Mentions", names.arg = tdmSort$Category[tdm15], las = 2, col = rainbow(15), cex.names = 0.8, horiz = TRUE)

par(op)

The results picked up some javascript tags, but those have been removed.

Comparing and Contrasting the Three Different Approaches

Analysis of Results

We see that big data is a clear winner, especially in the press results. But big data has come to be a byword, even a synonym for data science; indeed, one of the main differences between data science and conventional statistical techniques is that data science handles ‘big’ datasets generated by software, whereas statistics tends to focus on smaller or simpler samples, or otherwise refers to the specific mathematical techniques used to analyze the data - and not the earlier processes of gathering and tidying the data.

Big data has also become a buzzword in its own right, one of dubious distinction critics say. The work it describes spans the fields of applied math and computer science - so it can’t be lumped into a bigger field like statistics or computer science. Regardless, it’s safe to conclude that to be a data scientist, one must be comfortable with big data.

Next, at #2 in the press results, we have statistics. Statistics is a core competency of data science - as critical as reflexes are to a racecar driver.

In the Twitter data, the results are fairly consistent from day to day, with machine learning coming out on top; the other skills are closer to one another, so their relative frequency fluctuates a bit.

We could go on and classify each skill, but looking at each of these skills, patterns emerge; we can put them into one of three categories:

Skill Type I: Subject Mastery

Areas of competency are the general subjects and activities that transcend any specific piece of software or field, and are important or essential to being a good data scientist. These are things like numeracy and literacy and fluency in logic and visual aides. These skills are timeless, at least as long as the data scientists are human. Most of these are general, but some are more specific, like machine learning and regression.

It is possible, however, that some of these skills may wax and wane in importance as the tools for doing data science improve. Perhaps you manage to set up a data collection system where little to no data tidying is needed; or some advanced form of artificial intelligence makes creating compelling charts much easier. Certainly, the relative value of these competencies depends on the subject of your work.

Skill Type 2: Tools

These are the specific pieces of software data scientists use to do their jobs - and this is where you’ll see the most change over time, in what is and isn’t fashionable and in demand. Foremost among these are R, Python, Hadoop, SQL and other relational databases. Change is especially swift in areas where software can be changed easily - where organizations aren’t tied up using legacy hardware and software.

Still, some of these tools are staples of the data scientist, and will be for some time. R and Python are two obvious candidates. And even when such tools are replaced, facility with the old ones will help to understand the new ones that replace them. If say, some other database software came to the fore, experience with SQL would ease the transition to the new program.

Skill Type 3: Personal Traits

Different occupations call for different character traits. You hope that your surgeon is careful enough not to amputate the wrong leg, and that your favorite chef has good hygiene. Data scientists too have certain preferred qualities.

  • Communication: Data scientists should be able to express themselves clearly in non-technical terms, to other people that are data scientists and especially to those who aren’t.
  • Visualization: The data scientist should excel at using visual aides to make his points.
  • Perlustration: A data scientist must carefully and constantly examine her data for quirks and mishaps; if such irregularities go unseen, it could throw off the whole analysis.
  • Curiousity: He must want to learn, both within his domain and without; this will help him find new, useful sources of data, and complete the research necessary. A lack of curiosity and a compartmentalized perspective could lead to missed solutions and stagnation, where innovation is required.

Calling these items ‘traits’ can make them sound hard to change - but they can all be improved with focused practice and study, especially with the abundance of tools that technology offers.

TDM Discussion

This method yields many words that may be common in describing data science - like the word ‘data’ - but they aren’t a skill per se.

The TDM approach is valuable because it will highlight terms that weren’t explicitly searched. If one had no prior information on what skills were useful, the TDM approach would be a valuable tool. In a sense, TDM lets the data ‘tell’ us what is most important. But it still requires human oversight; the results of TDM must be parsed manually, to winnow through the frequent but irrelevant terms. Similar terms like statistics and statistical may be worth combining to determine a more accurate result.

This refining process can easily yield differing results. For instance, in our top 200 terms, what terms to pick out as skills is somewhat arbitrary - python is clearly a skill, but words like analysis, training, insights, experience, research, search, team are a bit more ambiguous - but even those are suggestive of what makes a successful data scientist.

And, as TDM’s large accompanying dataset suggests, it is more computationally intensive. With a larger corpus of work to search, the TDM approach would have to be altered to remain doable.

Of course, the TDM method employed is a simple one. It could be paired with say, a machine learning algorithm, with which it ‘learns’ to distinguish common but unhelpful words like “data” from helpful ones like “statistics.”

And in fact, such search methods have become commonplace in the legal industry, in a process known as discovery; in a given lawsuit, thousands of pages and emails might need to be searched to confirm or refute the allegations at hand. A burdensome, costly task for humans, but one well suited to a tireless algorithm.

In our work, TDM serves as a valuable complement to the other methods, even if it did not make any significantly different findings. And, these results are more akin to the words one would use to describe data science as a profession to laymen. Whereas the other results tended to focus on data-science specific terms, the TDM results highlighted general as well as specific skills, like management.

Conclusion

Three different approaches were applied for data collection and analysis. Analysis of these techniques yield comparable results with top 5 skills oriented towards `Big Data, Statistics, R, Machine Learning, and Python. Big Data skills were emphasized in Press and TDM, while Machine Learning ranked top in Twitter Collection method. These results of the current analysis should not minimize the importance some data science skills over others. The importance of data science skills likely varies by the requirements of the project. Skills can not be important unless they are required by the job parameters.

There is a dizzying array of buzzwords and skillsets linked to data science; it can be hard to know which ones to pursue, and which to dismiss. Our findings suggest that a few core disciplines underlie the field, and great facility with them is essential. These fields are, broadly, mathematics and computer science. Certain sub-fields within them, like statistics and machine learning, may deserve special study.

These subjects should be mastered in concert with the tools of the trade - these currently include R, Python, SQL and other tools. A mathematician without a command of a modern computer language will find it difficult to do data science work. And one should be careful to note that these tools are apt to change, that they are simply the latest incarnation of the methods of data science.

Lastly, one must have the personal qualities to make full use of these tools and subjects. You must be able to write clearly, and to express your findings in intuitive visual aides; you must have a penchant for learning new things in strange domains while keeping a close eye on your work.

And from a business perspective, data science is essential to turning data into insight. But, it can be hard to find experts with all the necessary talents - they may need to be groomed from within; if one is found, they must be kept engaged, and have the autonomy to design their own solutions, to turn data into prosperity, and transform the fate of a firm.

Further Reading: A Salary Survey

O’Reilly Media published a salary survey that sought to determine what skills are in demand by data scientists. Commendably, the study uses regression analysis to qualify the importance of various attributes, from geography to education to prowess with specific tools. It has an interesting comparison of salaries based on what sorts of tasks data scientists do, from extract, transform and load (ETL) work to meetings and exploratory data analysis.

The O’Reilly survey gives insight into the different career trajectories available to data scientists, as well as recent changes in the popularity of various tools of the trade. While our results seem more pointed at the fundamentals of the field, the repeated annual surveys of O’Reilly Media allow them to uncover temporal trends in the data science field.