Originally posted to r-chart.com - but not in nifty R Markdown!
O'Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary. The entire survey is available for download. In the survey results, R was heralded as second only to SQL as a tool used by conference attendees. An chart from the survey appeared in this post and elsewhere online.
These two technologies overlap a bit but are are highly complementary. SQL can be used to quickly extract data from relational databases and filter, order and summarize data. SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R. R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.
As part of a in-progress R screencast, I wanted to speculate a bit about the most common “clusters” of technologies that are popular among R users (at least the Strata Conference respondents). Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis. I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts. This would make usage reported among non-data respondents a bit clearer.
So the first step was to replicate the original plot with a few cosmetic and editorial updates - no “Respodents” appear in the new version. This involved the use of reshape2 and ggplot2.
library(reshape2)
library(ggplot2)
With these available, I created the data frame by combining a few vectors containing the data of interest.
data.science.tools <- as.data.frame(t(rbind(
c('All Respondents',
'SQL','R','Python','Excel','Hadoop','Java',
'Network/Graph','JavaScript','Tableau','D3',
'Mahout','Ruby','SAS/SPSS'),
c(57,42,33,26,25,23,17,16,7,15,8,7,5,9),
c(43,29,10,15,11,12,17,4,13,4,5,6,6,2)
)))
names(data.science.tools)=c('DataTool', 'Data', 'NonData')
At this point, the results match up with what appeared in the chart from the O'Reilly report. The numbers represent percent of respondents that use the given tool.
data.science.tools
## DataTool Data NonData
## 1 All Respondents 57 43
## 2 SQL 42 29
## 3 R 33 10
## 4 Python 26 15
## 5 Excel 25 11
## 6 Hadoop 23 12
## 7 Java 17 17
## 8 Network/Graph 16 4
## 9 JavaScript 7 13
## 10 Tableau 15 4
## 11 D3 8 5
## 12 Mahout 7 6
## 13 Ruby 5 6
## 14 SAS/SPSS 9 2
The data is easier to deal with if reshaped using melt.
data.science.tools.df <- melt(
data.science.tools,
c('DataTool'),
variable.name='Role',
value.name='Respondents'
)
The resulting data frame:
data.science.tools.df
## DataTool Role Respondents
## 1 All Respondents Data 57
## 2 SQL Data 42
## 3 R Data 33
## 4 Python Data 26
## 5 Excel Data 25
## 6 Hadoop Data 23
## 7 Java Data 17
## 8 Network/Graph Data 16
## 9 JavaScript Data 7
## 10 Tableau Data 15
## 11 D3 Data 8
## 12 Mahout Data 7
## 13 Ruby Data 5
## 14 SAS/SPSS Data 9
## 15 All Respondents NonData 43
## 16 SQL NonData 29
## 17 R NonData 10
## 18 Python NonData 15
## 19 Excel NonData 11
## 20 Hadoop NonData 12
## 21 Java NonData 17
## 22 Network/Graph NonData 4
## 23 JavaScript NonData 13
## 24 Tableau NonData 4
## 25 D3 NonData 5
## 26 Mahout NonData 6
## 27 Ruby NonData 6
## 28 SAS/SPSS NonData 2
Convert data into required numeric type
data.science.tools.df$Respondents <- as.numeric(
data.science.tools.df$Respondents
)
Create the original chart
ggplot(data = data.science.tools.df,
aes(x=reorder(DataTool,
Respondents,
function(x) max(x)
),
y=Respondents,
fill=Role)
) +
geom_bar(stat='identity') +
coord_flip() +
theme(axis.title.y = element_blank())
Now do the facetted example
ggplot(data = data.science.tools.df,
aes(x=reorder(DataTool,
Respondents,
function(x) max(x)),
y=Respondents,
fill=Role)
) +
geom_bar(stat='identity') +
coord_flip() +
facet_grid(. ~ Role) +
theme(axis.title.y = element_blank())
Those in the non-data role appear to be largely coming from a more traditional software development/programming background. The top tool in use after SQL is Java, followed by Python and JavaScript. Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R. Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R. As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.
To get a closer look at “non-Data” role.
ggplot(data = data.science.tools.df[data.science.tools.df$Role == "NonData",
], aes(x = reorder(DataTool, Respondents, function(x) max(x)), y = Respondents)) +
geom_bar(stat = "identity") + coord_flip() + theme(axis.title.y = element_blank())
There are a number of tools conspicuously lacking in the survey.
It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).
As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience. Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations. The fact that so many R packages are in essence full-fledged DSLs has further complicated R's presentation. As I mentioned in my previous post, Hadley's new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way. And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well.