R and (Software) Relatives

Originally posted to r-chart.com - but not in nifty R Markdown!

O'Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary. The entire survey is available for download. In the survey results, R was heralded as second only to SQL as a tool used by conference attendees. An chart from the survey appeared in this post and elsewhere online.

These two technologies overlap a bit but are are highly complementary. SQL can be used to quickly extract data from relational databases and filter, order and summarize data. SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R. R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.

As part of a in-progress R screencast, I wanted to speculate a bit about the most common “clusters” of technologies that are popular among R users (at least the Strata Conference respondents). Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis. I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts. This would make usage reported among non-data respondents a bit clearer.

So the first step was to replicate the original plot with a few cosmetic and editorial updates - no “Respodents” appear in the new version. This involved the use of reshape2 and ggplot2.

library(reshape2)
library(ggplot2)

With these available, I created the data frame by combining a few vectors containing the data of interest.

data.science.tools <- as.data.frame(t(rbind(

  c('All Respondents',
    'SQL','R','Python','Excel','Hadoop','Java',
    'Network/Graph','JavaScript','Tableau','D3',
    'Mahout','Ruby','SAS/SPSS'),

  c(57,42,33,26,25,23,17,16,7,15,8,7,5,9),

  c(43,29,10,15,11,12,17,4,13,4,5,6,6,2)
)))

names(data.science.tools)=c('DataTool', 'Data', 'NonData')

At this point, the results match up with what appeared in the chart from the O'Reilly report. The numbers represent percent of respondents that use the given tool.

data.science.tools

##           DataTool Data NonData
## 1  All Respondents   57      43
## 2              SQL   42      29
## 3                R   33      10
## 4           Python   26      15
## 5            Excel   25      11
## 6           Hadoop   23      12
## 7             Java   17      17
## 8    Network/Graph   16       4
## 9       JavaScript    7      13
## 10         Tableau   15       4
## 11              D3    8       5
## 12          Mahout    7       6
## 13            Ruby    5       6
## 14        SAS/SPSS    9       2

The data is easier to deal with if reshaped using melt.

data.science.tools.df <- melt(
  data.science.tools, 
  c('DataTool'), 
  variable.name='Role', 
  value.name='Respondents'
)

The resulting data frame:

data.science.tools.df

##           DataTool    Role Respondents
## 1  All Respondents    Data          57
## 2              SQL    Data          42
## 3                R    Data          33
## 4           Python    Data          26
## 5            Excel    Data          25
## 6           Hadoop    Data          23
## 7             Java    Data          17
## 8    Network/Graph    Data          16
## 9       JavaScript    Data           7
## 10         Tableau    Data          15
## 11              D3    Data           8
## 12          Mahout    Data           7
## 13            Ruby    Data           5
## 14        SAS/SPSS    Data           9
## 15 All Respondents NonData          43
## 16             SQL NonData          29
## 17               R NonData          10
## 18          Python NonData          15
## 19           Excel NonData          11
## 20          Hadoop NonData          12
## 21            Java NonData          17
## 22   Network/Graph NonData           4
## 23      JavaScript NonData          13
## 24         Tableau NonData           4
## 25              D3 NonData           5
## 26          Mahout NonData           6
## 27            Ruby NonData           6
## 28        SAS/SPSS NonData           2

Convert data into required numeric type

data.science.tools.df$Respondents <- as.numeric(
  data.science.tools.df$Respondents
)

Create the original chart

ggplot(data = data.science.tools.df, 
       aes(x=reorder(DataTool, 
                      Respondents, 
                      function(x) max(x)
                     ), 
            y=Respondents, 
            fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip() + 
  theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-7

Now do the facetted example

ggplot(data = data.science.tools.df,
       aes(x=reorder(DataTool, 
                     Respondents, 
                     function(x) max(x)), 
           y=Respondents, 
           fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip()  + 
  facet_grid(. ~ Role) + 
  theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-8

Those in the non-data role appear to be largely coming from a more traditional software development/programming background. The top tool in use after SQL is Java, followed by Python and JavaScript. Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R. Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R. As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.

To get a closer look at “non-Data” role.

ggplot(data = data.science.tools.df[data.science.tools.df$Role == "NonData", 
    ], aes(x = reorder(DataTool, Respondents, function(x) max(x)), y = Respondents)) + 
    geom_bar(stat = "identity") + coord_flip() + theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-9

There are a number of tools conspicuously lacking in the survey.

Microsoft programming is completely absent.
Command line utilities (like awk, sed, sqlite3 and some others)
Perl

It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).

As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience. Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations. The fact that so many R packages are in essence full-fledged DSLs has further complicated R's presentation. As I mentioned in my previous post, Hadley's new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way. And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well.