Exploratory Data Analysis

Twitter Data

Popularity Contest, Part 1: The Gift of Twitter Gab

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, four times on three different dates: March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, which we’ll discuss in part two, followed by discussion of the TDM matrix in part three. The data aggregation will be completed separately, but the analysis of the three parts will be done jointly.

## Loading required package: plyr

## Loading required package: RColorBrewer

## Loading required package: ggplot2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:graphics':
## 
##     layout

We took to Twitter and scraped it for the number of times various data science related skills were mentioned, on three different dates, March 16, 19 and 20.

We have also scraped the contents of 20 articles for the number of mentions of these same skills, but we’ll leave that for part two.

Step 1: Load the Data

We’ll load in the Twitter frequency data from our online repository; since the first column is merely line numbers, we can excise that.

twitter.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/frequency_results.csv")

twitter <- read.csv(twitter.url, stringsAsFactors = FALSE, sep = ",")
twitter <- twitter[,2:4]

View(twitter)
kable(head(twitter))

skill_id	t_freq	dates
1	0	2016-03-16
2	0	2016-03-16
3	0	2016-03-16
4	0	2016-03-16
5	5	2016-03-16
6	0	2016-03-16

kable(tail(twitter))

	skill_id	dates
591	144	2016-03-20
592	145	2016-03-20
593	146	2016-03-20
594	147	2016-03-20
595	148	2016-03-20
596	149	2016-03-20

The skills are simply numbered from 1 onwards - we see that there are 149 skills.

Step 2: Aggregating & Tidying the Data

Looking at the above tables, we see have multiple dates worth of data. Since these days are such days apart, it’s not worth analyzing any temporal trend - eg, how certain skills have become more or less popular over time. If we had gathered data years apart, then that would have been a more fruitful exercise.

Let’s combine that data into one table of 149 rows, along with the respective skill titles Our skill titles are stored in a separate table.

First, we subset the data table into its respective portions, and load the skill titles table.

twitPart1 <- subset(twitter, dates == "2016-03-16")
twitPart2 <- subset(twitter, dates == "2016-03-19")[1:149, ]
twitPart3 <- subset(twitter, dates == "2016-03-19")[150:298, ]
twitPart4 <- subset(twitter, dates == "2016-03-20")

skillTitle.url <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/skillsAsher.csv") # The URL where the file listing the skill titles is located.
skillTitle <- read.csv(skillTitle.url, stringsAsFactors = FALSE) # Reads the skill titles into R.

twitAllDates <- data.frame(skill_id = twitPart1$skill_id, t_freq = twitPart1$t_freq + twitPart2$t_freq + twitPart3$t_freq + twitPart4$t_freq, skill_title = skillTitle$skill_name, stringsAsFactors = FALSE)

Let’s Kick Out the Losers: Skills with Zero Mentions

Looking at the data, we see that a number of skills don’t get mentioned at all. How many?

zeroTwits <-subset(twitAllDates, twitAllDates$t_freq == 0) # The subset of frequencies that are zero
nrow(twitAllDates) #The total number of skills

## [1] 149

nrow(zeroTwits) #The number of skills with zero mentions

## [1] 84

nrow(zeroTwits)/nrow(twitAllDates) #The proportion of skills with zero mentions

## [1] 0.5637584

About 56% of the skills we searched for were never mentioned. To reduce the chance of leaving out an important skill, we clearly included lots of skills that were not commonly talked about. That so many of our skills garnered no mention is not a cause for concern here. You could say our search was sensitive, but not specific :).

However, it’s important to note, that where there is a significant cost associated with gathering data, one must be more judicious about selecting what data to gather - then you can’t just dream up a Christmas wishlist of variables and ask for it all.

Step 3: Sorting the Data

From here, we’ll limit our investigation to the skills with positive frequencies, hereafter twitPositive.

twitPositive <- subset(twitAllDates, twitAllDates$t_freq > 0)
twitSort <- twitPositive[order(-twitPositive$t_freq), ] #Sort results by frequency, descending
View(twitSort)

We’ll have to do some minor cleaning - we have both ‘machinelearning’ and ‘machine learning’ in our dataset, and they both rank very high. We’ll combine them into one entry, named ‘ML,’ because otherwise it will be long and make plotting natively in R more troublesome.

Also, one of our top results has a title that is too long, ‘predictive analytics’ - that will be shortened to ‘pred. analysis.’

MLRowNum <- which(twitSort$skill_title == "machinelearning") #The row number of machinelearning
M_LRowNum <- which(twitSort$skill_title == "machine learning")#The row number of machine learning (i.e. with a space in between)
twitSort$skill_title[MLRowNum] <- "ML" #Renames machinelearning to ML
twitSort$t_freq[MLRowNum] <- twitSort$t_freq[MLRowNum] + twitSort$t_freq[M_LRowNum] #Sums the two ML frequencies together
twitSort <- twitSort[-M_LRowNum, ] #Deletes the duplicate row
#which(twitSort$skill_title == "machine learning")


PARowNum <- which(twitSort$skill_title == "predictive analytics")
twitSort$skill_title[PARowNum] <- "pred. analysis"

View(twitSort)

In List Form, Twitter:

kable(twitSort[1:15, 3])

big data

python

statistics

research

infographic

innovation

pandas

spark

hadoop

visualization

programming

regression

pred. analysis

Before we begin to visualize and understand the data, we’ll do all the previous steps again for the mentions we gathered from published articles. Then we’ll compare the data from each source side by side.

URL Data - The “Best Guess” Approach

Part II: Mentions of Data Science Skills in the Press

We’ll replicate the process we did with Twitter for our dataset of mentions in the press. Our sample is 91 published articles. We checked each article against a similar list of skills, to count the number of mentions.

Our datasets were formatted slightly differently, because different teams were involved, and no standards were hashed out beforehand; it’s not a big deal here, because the data is simple and we are doing this project as a ‘one-off’ but if this were a routine activity, we’d want to ensure the datasets were formatted identically, so we’d want to spend some time standardizing the data output format before gathering the data.

Step 1: Loading the data

articleURL <- url("https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/data/Build-URL_DataFrame-Output.csv")
articleData <- read.csv(articleURL, stringsAsFactors = FALSE, sep = ",")
View(head(articleData))

Step 2: Aggregating & Winnowing the Mentions Across Articles

We’ll want to aggregate the number of mentions in each article to a grand total of sums across articles. We’ll use the aggregate function.

articleAgg <- aggregate(articleData$ds_freq, by=list(Category= articleData$skill_name), FUN=sum)
names(articleAgg) <- c("skill", "frequency")

View(articleAgg)

With a simple function, we’ve consolidated our ~14,000 lines of data into 149. We can winnow this data down further, by removing the skills that did not garner a single mention.

articlePositive <- subset(articleAgg, articleAgg$frequency > 0)
nrow(articlePositive)

## [1] 115

View(articlePositive)

Now we have 115 skills with at least one mention in the articles we studied. 34 skills out of 149, or 23%, were not mentioned once.

Step 3: Sorting the Data

Now, we can start looking at the most frequently mentioned skills. We’ll sort the skills according to their number of mentions, in descending order:

articleSort <- articlePositive[order(-articlePositive$frequency), ] #Sort results by frequency, descending
kable(head(articleSort))

	skill	frequency
14	big data	704
132	Statistics	359
111	R	323
75	Machine Learning	297
54	Hadoop	272
109	programming	246

And the Press:

kable(articleSort[1:15, 1])

big data

Statistics

Machine Learning

Hadoop

programming

Python

Visualization

Data Mining

Research

SQL

Java

communication

C++

URL Data - The Term-Document Matrix Approach

Part III: Term Document Matrix - Most Frequently Used Words in the Press

The output of the term document matrix was posted on Github, from where we will retrieve it. This has the counts of each word in each article; we will aggregate these word totals into a sum for each word across all articles, and sort them in descending order of frequency.

tdmURL <- "https://raw.githubusercontent.com/RobertSellers/SlackProjects/master/term-document-matrix/tdm-df"

tdmData <- read.csv(url(tdmURL), stringsAsFactors = FALSE, sep = ",")
names(tdmData) <- c("article", "term", "freq")

tdmAgg <- aggregate(tdmData$freq, by=list(Category= tdmData$term), FUN=sum)
tdmSort <- tdmAgg[order(-tdmAgg$x), ] #Sort results by frequency, descending
tdmSort$rank <- seq(1:nrow(tdmSort))

TDM Results

Below follow the 200 most frequent terms in the articles searched.

kable(tdmSort[1:200, ])

	Category	x	rank
245	data	5291	1
364	function	1900	2
776	s	1479	3
786	science	1374	4
1036	var	1319	5
257	data science	1107	6
834	skills	1069	7
1001	true	904	8
793	scientist	877	9
258	data scientist	791	10
610	new	776	11
332	false	720	12
288	document	668	13
167	business	657	14
796	scientists	650	15
157	big	648	16
725	px	618	17
259	data scientists	594	18
296	e	574	19
45	a data	567	20
1071	window	561	21
158	big data	554	22
636	of the	554	23
36	var	531	24
91	analytics	531	25
176	can	530	26
997	top	490	27
736	r	470	28
437	id	463	29
455	in the	461	30
758	require	460	31
621	null	446	32
21	if	438	33
503	js	436	34
1008	type	436	35
24	new	431	36
1068	will	411	37
13	document	408	38
722	push	408	39
634	of data	407	40
1035	value	406	41
893	t	404	42
800	script	399	43
523	learning	390	44
429	http	384	45
1		365	46
766	return	359	47
861	src	346	48
17	function	339	49
501	jquery	335	50
873	statistics	329	51
497	job	311	52
601	need	303	53
650	one	296	54
86	analysis	288	55
446	important	288	56
929	the data	284	57
559	machine	281	58
37	window	279	59
431	https	275	60
200	com	273	61
995	tools	266	62
522	learn	265	63
533	like	263	64
970	to be	258	65
284	display	253	66
32	this	251	67
80	also	250	68
243	d	247	69
302	else	247	70
379	get	243	71
581	may	237	72
1023	use	236	73
714	programming	233	74
1021	url	233	75
403	hadoop	229	76
483	is a	223	77
566	management	218	78
1079	work	216	79
625	o	214	80
674	people	209	81
452	in data	205	82
986	to the	204	83
233	createelement	203	84
133	async	202	85
569	many	202	86
173	c	199	87
871	statistical	199	88
511	know	198	89
602	need to	195	90
528	length	194	91
342	find	192	92
425	how to	192	93
599	name	191	94
135	at	190	95
573	march	190	96
161	blog	186	97
846	software	186	98
7	at	185	99
391	good	184	100
413	height	184	101
344	first	183	102
450	in a	183	103
726	python	183	104
325	experience	181	105
563	make	181	106
831	skill	181	107
1087	www	180	108
466	information	179	109
556	m	179	110
1033	v	178	111
588	mining	175	112
162	body	173	113
507	just	172	114
211	companies	171	115
351	for data	171	116
516	language	171	117
612	news	170	118
791	science skills	170	119
1022	us	169	120
12	data	167	121
304	email	167	122
965	time	167	123
615	none	164	124
23	jquery	163	125
806	see	163	126
1032	using	162	127
537	linkedin	158	128
216	computer	157	129
648	on the	157	130
127	as a	156	131
1067	width	154	132
667	parentnode	153	133
35	true	152	134
494	january	152	135
151	become	151	136
256	data mining	151	137
548	log	151	138
1046	visualization	150	139
1076	with the	149	140
690	post	147	141
104	and the	146	142
193	class	146	143
337	field	146	144
29	s	145	145
547	location	145	146
744	read	145	147
125	article	144	148
195	click	144	149
88	analyst	143	150
412	head	143	151
241	customer	142	152
199	code	141	153
213	company	141	154
186	cdata	140	155
414	help	140	156
816	set	140	157
10	cdata	139	158
286	div	139	159
443	if you	139	160
499	jobs	139	161
281	different	137	162
198	cloud	136	163
206	comments	136	164
527	left	136	165
769	right	136	166
906	technology	136	167
488	is the	135	168
154	best	134	169
142	b	133	170
716	project	132	171
72	algorithms	131	172
630	of a	131	173
238	css	130	174
803	search	130	175
428	html	129	176
18	http	128	177
138	at the	128	178
472	insights	128	179
900	team	128	180
490	it is	127	181
960	think	127	182
1094	you can	127	183
14	else	125	184
938	the most	125	185
1083	world	125	186
28	return	124	187
469	insertbefore	124	188
538	list	124	189
620	now	124	190
762	research	122	191
967	title	122	192
144	background	121	193
745	ready	121	194
999	training	121	195
56	able	120	196
397	great	120	197
57	able to	119	198
229	course	119	199
279	development	119	200

Below is a list of data science skills within the top 200 mentions of the TDM results:

22 Big Data

25 Analytics

51 statistics

55 analysis

74 programming

76 hadoop

78 management

88 statistical

104 python

105 experience

112 mining

128 linkedin

137 data mining

139 visualization

163 cloud

172 algorithms

175 search

179 insights

180 team

191 research

195 training

200 development

Data Visualization

Visualizing the Data

We’ll bring back our Twitter data here, so that we can look at the results of all three datasets together.

We’ll consider all three sets of results in barplot form, and use the rainbow() function to distinguish one bar from another.

barplot(twitSort$t_freq, main = "Twitter Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz = TRUE, col = rainbow(nrow(twitSort)))

barplot(articleSort$frequency, main = "Press Mentions of Data Science Skills", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(articleSort)))

barplot(tdmSort$x, main = "Most Used Words in the Press, via TDM", xlab = "# of Mentions", ylab = "skills", horiz =TRUE, col = rainbow(nrow(tdmSort)))

While this is useful for getting a sense of the distribution of skills, there are simply too many skills to put in a single graph. But these graphs do convey that some skills get many more mentions than others - even after we’ve removed the skills with zero mentions.

Let’s take a look at the top 15 skills

Top 15 Data Science Skills in the Press

top15 <- articleSort[1:15, ]


p <- qplot(skill, frequency, data = top15, color = skill)
p + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Data Science Skills in Wordcloud Form, via the Press

set.seed(1234)
wordcloud(words = articleSort$skill, freq = articleSort$frequency, rot.per=0.45, colors=brewer.pal(8, "Dark2"))

In barplot form, the top 15:

tdm15 <- rev(c(22, 25, 51, 55, 74, 76, 78, 88, 104, 105, 112, 128, 137, 139, 163))

op <- par(mar = c(4.2,6.3,3,2) + 0.1)
barplot(tdmSort$x[tdm15], main = "The 15 Most Mentioned Data Science Terms, via TDM", xlab = "# of Mentions", names.arg = tdmSort$Category[tdm15], las = 2, col = rainbow(15), cex.names = 0.8, horiz = TRUE)

par(op)

The results picked up some javascript tags, but those have been removed.

Comparing and Contrasting the Three Different Approaches

Analysis of Results

We see that big data is a clear winner, especially in the press results. But big data has come to be a byword, even a synonym for data science; indeed, one of the main differences between data science and conventional statistical techniques is that data science handles ‘big’ datasets generated by software, whereas statistics tends to focus on smaller or simpler samples, or otherwise refers to the specific mathematical techniques used to analyze the data - and not the earlier processes of gathering and tidying the data.

Big data has also become a buzzword in its own right, one of dubious distinction critics say. The work it describes spans the fields of applied math and computer science - so it can’t be lumped into a bigger field like statistics or computer science. Regardless, it’s safe to conclude that to be a data scientist, one must be comfortable with big data.

Next, at #2 in the press results, we have statistics. Statistics is a core competency of data science - as critical as reflexes are to a racecar driver.

In the Twitter data, the results are fairly consistent from day to day, with machine learning coming out on top; the other skills are closer to one another, so their relative frequency fluctuates a bit.

We could go on and classify each skill, but looking at each of these skills, patterns emerge; we can put them into one of three categories:

Skill Type I: Subject Mastery

Areas of competency are the general subjects and activities that transcend any specific piece of software or field, and are important or essential to being a good data scientist. These are things like numeracy and literacy and fluency in logic and visual aides. These skills are timeless, at least as long as the data scientists are human. Most of these are general, but some are more specific, like machine learning and regression.

It is possible, however, that some of these skills may wax and wane in importance as the tools for doing data science improve. Perhaps you manage to set up a data collection system where little to no data tidying is needed; or some advanced form of artificial intelligence makes creating compelling charts much easier. Certainly, the relative value of these competencies depends on the subject of your work.

Skill Type 2: Tools

These are the specific pieces of software data scientists use to do their jobs - and this is where you’ll see the most change over time, in what is and isn’t fashionable and in demand. Foremost among these are R, Python, Hadoop, SQL and other relational databases. Change is especially swift in areas where software can be changed easily - where organizations aren’t tied up using legacy hardware and software.

Still, some of these tools are staples of the data scientist, and will be for some time. R and Python are two obvious candidates. And even when such tools are replaced, facility with the old ones will help to understand the new ones that replace them. If say, some other database software came to the fore, experience with SQL would ease the transition to the new program.

Skill Type 3: Personal Traits

Different occupations call for different character traits. You hope that your surgeon is careful enough not to amputate the wrong leg, and that your favorite chef has good hygiene. Data scientists too have certain preferred qualities.

Communication: Data scientists should be able to express themselves clearly in non-technical terms, to other people that are data scientists and especially to those who aren’t.
Visualization: The data scientist should excel at using visual aides to make his points.
Perlustration: A data scientist must carefully and constantly examine her data for quirks and mishaps; if such irregularities go unseen, it could throw off the whole analysis.
Curiousity: He must want to learn, both within his domain and without; this will help him find new, useful sources of data, and complete the research necessary. A lack of curiosity and a compartmentalized perspective could lead to missed solutions and stagnation, where innovation is required.

Calling these items ‘traits’ can make them sound hard to change - but they can all be improved with focused practice and study, especially with the abundance of tools that technology offers.

TDM Discussion

This method yields many words that may be common in describing data science - like the word ‘data’ - but they aren’t a skill per se.

The TDM approach is valuable because it will highlight terms that weren’t explicitly searched. If one had no prior information on what skills were useful, the TDM approach would be a valuable tool. In a sense, TDM lets the data ‘tell’ us what is most important. But it still requires human oversight; the results of TDM must be parsed manually, to winnow through the frequent but irrelevant terms. Similar terms like statistics and statistical may be worth combining to determine a more accurate result.

This refining process can easily yield differing results. For instance, in our top 200 terms, what terms to pick out as skills is somewhat arbitrary - python is clearly a skill, but words like analysis, training, insights, experience, research, search, team are a bit more ambiguous - but even those are suggestive of what makes a successful data scientist.

And, as TDM’s large accompanying dataset suggests, it is more computationally intensive. With a larger corpus of work to search, the TDM approach would have to be altered to remain doable.

Of course, the TDM method employed is a simple one. It could be paired with say, a machine learning algorithm, with which it ‘learns’ to distinguish common but unhelpful words like “data” from helpful ones like “statistics.”

And in fact, such search methods have become commonplace in the legal industry, in a process known as discovery; in a given lawsuit, thousands of pages and emails might need to be searched to confirm or refute the allegations at hand. A burdensome, costly task for humans, but one well suited to a tireless algorithm.

In our work, TDM serves as a valuable complement to the other methods, even if it did not make any significantly different findings. And, these results are more akin to the words one would use to describe data science as a profession to laymen. Whereas the other results tended to focus on data-science specific terms, the TDM results highlighted general as well as specific skills, like management.

Conclusion

Three different approaches were applied for data collection and analysis. Analysis of these techniques yield comparable results with top 5 skills oriented towards `Big Data, Statistics, R, Machine Learning, and Python. Big Data skills were emphasized in Press and TDM, while Machine Learning ranked top in Twitter Collection method. These results of the current analysis should not minimize the importance some data science skills over others. The importance of data science skills likely varies by the requirements of the project. Skills can not be important unless they are required by the job parameters.

There is a dizzying array of buzzwords and skillsets linked to data science; it can be hard to know which ones to pursue, and which to dismiss. Our findings suggest that a few core disciplines underlie the field, and great facility with them is essential. These fields are, broadly, mathematics and computer science. Certain sub-fields within them, like statistics and machine learning, may deserve special study.

These subjects should be mastered in concert with the tools of the trade - these currently include R, Python, SQL and other tools. A mathematician without a command of a modern computer language will find it difficult to do data science work. And one should be careful to note that these tools are apt to change, that they are simply the latest incarnation of the methods of data science.

Lastly, one must have the personal qualities to make full use of these tools and subjects. You must be able to write clearly, and to express your findings in intuitive visual aides; you must have a penchant for learning new things in strange domains while keeping a close eye on your work.

And from a business perspective, data science is essential to turning data into insight. But, it can be hard to find experts with all the necessary talents - they may need to be groomed from within; if one is found, they must be kept engaged, and have the autonomy to design their own solutions, to turn data into prosperity, and transform the fate of a firm.

Data 607 Project 3: Data Analysis

Asher Meyers & Gurpreet Singh

March 25, 2016

Exploratory Data Analysis

Twitter Data

Popularity Contest, Part 1: The Gift of Twitter Gab

Step 1: Load the Data

Step 2: Aggregating & Tidying the Data

Let’s Kick Out the Losers: Skills with Zero Mentions

Step 3: Sorting the Data

In List Form, Twitter:

URL Data - The “Best Guess” Approach

Part II: Mentions of Data Science Skills in the Press

Step 1: Loading the data

Step 2: Aggregating & Winnowing the Mentions Across Articles

Step 3: Sorting the Data

And the Press:

URL Data - The Term-Document Matrix Approach

Part III: Term Document Matrix - Most Frequently Used Words in the Press

TDM Results

22 Big Data

25 Analytics

51 statistics

55 analysis

74 programming

76 hadoop

78 management

88 statistical

104 python

105 experience

112 mining

128 linkedin

137 data mining

139 visualization

163 cloud

172 algorithms

175 search

179 insights

180 team

191 research

195 training

200 development

Data Visualization

Visualizing the Data

Top 15 Data Science Skills in the Press

Data Science Skills in Wordcloud Form, via the Press

Comparing and Contrasting the Three Different Approaches

Analysis of Results

Skill Type I: Subject Mastery

Skill Type 2: Tools

Skill Type 3: Personal Traits

TDM Discussion

Conclusion

Further Reading: A Salary Survey