R and (Software) Relatives

Originally posted to r-chart.com - but not in nifty R Markdown!

O'Reilly recently published the results of a survey from attendees of the Strata Conference related to tool usage and salary. The entire survey is available for download. In the survey results, R was heralded as second only to SQL as a tool used by conference attendees. An chart from the survey appeared in this post and elsewhere online.

These two technologies overlap a bit but are are highly complementary. SQL can be used to quickly extract data from relational databases and filter, order and summarize data. SQL queries can be executed with R itself or in another language to produce a CSV file that can be imported into R. R can do additional filtering, ordering and summarizing, be used for more sophisticated analysis, reshaping of data and presentation in a final form.

As part of a in-progress R screencast, I wanted to speculate a bit about the most common “clusters” of technologies that are popular among R users (at least the Strata Conference respondents). Although the raw data from the survey is not available, the graph cited in the survey results includes enough information to do a bit of additional analysis. I reconstructed the original graph as a starting point with the intention of splitting out the data and non-data roles into facetted bar charts. This would make usage reported among non-data respondents a bit clearer.

So the first step was to replicate the original plot with a few cosmetic and editorial updates - no “Respodents” appear in the new version. This involved the use of reshape2 and ggplot2.

library(reshape2)
library(ggplot2)

With these available, I created the data frame by combining a few vectors containing the data of interest.

data.science.tools <- as.data.frame(t(rbind(

  c('All Respondents',
    'SQL','R','Python','Excel','Hadoop','Java',
    'Network/Graph','JavaScript','Tableau','D3',
    'Mahout','Ruby','SAS/SPSS'),

  c(57,42,33,26,25,23,17,16,7,15,8,7,5,9),

  c(43,29,10,15,11,12,17,4,13,4,5,6,6,2)
)))

names(data.science.tools)=c('DataTool', 'Data', 'NonData')

At this point, the results match up with what appeared in the chart from the O'Reilly report. The numbers represent percent of respondents that use the given tool.

data.science.tools
##           DataTool Data NonData
## 1  All Respondents   57      43
## 2              SQL   42      29
## 3                R   33      10
## 4           Python   26      15
## 5            Excel   25      11
## 6           Hadoop   23      12
## 7             Java   17      17
## 8    Network/Graph   16       4
## 9       JavaScript    7      13
## 10         Tableau   15       4
## 11              D3    8       5
## 12          Mahout    7       6
## 13            Ruby    5       6
## 14        SAS/SPSS    9       2

The data is easier to deal with if reshaped using melt.

data.science.tools.df <- melt(
  data.science.tools, 
  c('DataTool'), 
  variable.name='Role', 
  value.name='Respondents'
)

The resulting data frame:

data.science.tools.df
##           DataTool    Role Respondents
## 1  All Respondents    Data          57
## 2              SQL    Data          42
## 3                R    Data          33
## 4           Python    Data          26
## 5            Excel    Data          25
## 6           Hadoop    Data          23
## 7             Java    Data          17
## 8    Network/Graph    Data          16
## 9       JavaScript    Data           7
## 10         Tableau    Data          15
## 11              D3    Data           8
## 12          Mahout    Data           7
## 13            Ruby    Data           5
## 14        SAS/SPSS    Data           9
## 15 All Respondents NonData          43
## 16             SQL NonData          29
## 17               R NonData          10
## 18          Python NonData          15
## 19           Excel NonData          11
## 20          Hadoop NonData          12
## 21            Java NonData          17
## 22   Network/Graph NonData           4
## 23      JavaScript NonData          13
## 24         Tableau NonData           4
## 25              D3 NonData           5
## 26          Mahout NonData           6
## 27            Ruby NonData           6
## 28        SAS/SPSS NonData           2

Convert data into required numeric type

data.science.tools.df$Respondents <- as.numeric(
  data.science.tools.df$Respondents
)

Create the original chart

ggplot(data = data.science.tools.df, 
       aes(x=reorder(DataTool, 
                      Respondents, 
                      function(x) max(x)
                     ), 
            y=Respondents, 
            fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip() + 
  theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-7

Now do the facetted example

ggplot(data = data.science.tools.df,
       aes(x=reorder(DataTool, 
                     Respondents, 
                     function(x) max(x)), 
           y=Respondents, 
           fill=Role)
       ) + 
  geom_bar(stat='identity') + 
  coord_flip()  + 
  facet_grid(. ~ Role) + 
  theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-8

Those in the non-data role appear to be largely coming from a more traditional software development/programming background. The top tool in use after SQL is Java, followed by Python and JavaScript. Hadoop is closely related as a java-based framework. Excel is used more than Excel, which suggests a fascinating opportunity for R. Spreadsheets are and will remain useful, but anyone involved in data munging and analysis can benefit from R. As has been oft-trumpeted, scripted R programs are far more controlled and disciplined than clicking around in a spreadsheet. They promote reproducible, less error-prone results. Ruby ranks a bit higher than among the non-data users and SAS/SPSS usage is minimal which also fits with a programmer audience.

To get a closer look at “non-Data” role.

ggplot(data = data.science.tools.df[data.science.tools.df$Role == "NonData", 
    ], aes(x = reorder(DataTool, Respondents, function(x) max(x)), y = Respondents)) + 
    geom_bar(stat = "identity") + coord_flip() + theme(axis.title.y = element_blank())

plot of chunk unnamed-chunk-9

There are a number of tools conspicuously lacking in the survey.

It also be interesting to see related data about respondents that undoubtedly impact the results (mathematical proficiency, design abilities, typical data stores / database types accessed, typical audience for summarized data).

As I have been reviewing literature and educational resources on R, I am developing a stronger opinion that R, though a remarkable functional and powerful programming language, has not been presented well to a programming audience. Most introductions to R are more palatable to statisticians and others who have data analysis to complete but are not strongly aligned with programmer culture and expectations. The fact that so many R packages are in essence full-fledged DSLs has further complicated R's presentation. As I mentioned in my previous post, Hadley's new book and RStudio are significant in-roads that highlight R in a more programmer-friendly way. And the involvement of programmers at the Strata Conference and similar events will increase its visibility and accessibility as well.