Exploring Careers Data with sqlstackr, dplyr, and ggplot2

Note: this was originally an internal Stack Overflow tutorial. This analysis won’t work without the sqlstackr package and database access, meaning it’s only intended as an example of internal documentation.

This is a basic tutorial showing how to use R to request, manipulate, and visualize information from the Careers database. This uses a combination of:

This combination of packages is powerful for performing analyses and answering statistical questions about our data, especially involving statistical tests and graphics.

Getting data from Careers

Let’s say I want to know some information about people’s CV Sections: that is, the jobs on their resume. I could come up with a SQL query something like this:

library(dplyr)
library(sqlstackr)

q <- "select Id, CVId, Name, Tags, StartYear, EndYear, IsCurrent from CVSections
        where CVSectionTypeId = 2 and
        StartYear is not null"

CV_sections <- query_Careers(q, collect = TRUE)

This query_Careers function is sqlstackr’s function for sending a SQL query to the Careers database. (If you’re wondering, there’s also query_StackOverflow for core Q&A, query_HAProxyLogs for traffic data, query_Calculon for ad server data, and so on). collect = TRUE tells the function to bring the results of the query in memory into R (otherwise it will just supply a preview).

Now, if you print this (by putting CV_sections into your terminal) you’ll see the first few rows:

CV_sections
## Source: local data frame [540,310 x 7]
## 
##       Id  CVId                               Name
##    (int) (int)                              (chr)
## 1      2     1                             Intern
## 2      6     3               Technical Evangelist
## 3      8     4                    Marathon Runner
## 4     14     7                         Chief Geek
## 5     18     9                  Software Engineer
## 6     28    13                  Software Engineer
## 7     32    15                         Programmer
## 8     36    17                     Lead Developer
## 9     38    18 Senior Electrical Project Engineer
## 10    40    19                     .NET Developer
## ..   ...   ...                                ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
##   (lgl)

This is a data frame, which is the most common data type in R you’ll be working with. You can also view it like an spreadsheet by doing:

View(CV_sections)

Data cleaning with filter and mutate

Before we do any analysis, we usually have to do a bit of data-cleaning. We’ll do this with functions from the dplyr package.

One data cleaning issue is that people put really wacky years for their Start and End dates, like “1754” or “2110”. This will mess with our analyses and graphs. Let’s limit it to a range where most of our valid resume items are. We can do this with filter, which filters for particular rows of the table based on a condition (like a “where” statement in SQL or LINQ):

cleaned <- filter(CV_sections, StartYear >= 1996, StartYear < 2016)
cleaned
## Source: local data frame [525,643 x 7]
## 
##       Id  CVId                               Name
##    (int) (int)                              (chr)
## 1      2     1                             Intern
## 2      6     3               Technical Evangelist
## 3      8     4                    Marathon Runner
## 4     14     7                         Chief Geek
## 5     18     9                  Software Engineer
## 6     32    15                         Programmer
## 7     36    17                     Lead Developer
## 8     38    18 Senior Electrical Project Engineer
## 9     40    19                     .NET Developer
## 10    42    20                     Student Mentor
## ..   ...   ...                                ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
##   (lgl)

(We omit 2016 since it’s early in the year and results will be unusual). Notice that before there were 540310 rows, but now there are 525643 rows.

You should get into the habit of writing this function call a different way:

CV_sections %>%
  filter(StartYear >= 1996, StartYear < 2016)
## Source: local data frame [525,643 x 7]
## 
##       Id  CVId                               Name
##    (int) (int)                              (chr)
## 1      2     1                             Intern
## 2      6     3               Technical Evangelist
## 3      8     4                    Marathon Runner
## 4     14     7                         Chief Geek
## 5     18     9                  Software Engineer
## 6     32    15                         Programmer
## 7     36    17                     Lead Developer
## 8     38    18 Senior Electrical Project Engineer
## 9     40    19                     .NET Developer
## 10    42    20                     Student Mentor
## ..   ...   ...                                ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
##   (lgl)

This %>% comes with dplyr and is called a “pipe operator”: it simply means “insert the CV_sections object as the first argument of count. Why do this? Because it allows us to chain many dplyr operations together: without having nested functions. You’ll see this become useful soon.

Another data-cleaning issue is that there is sometimes trailing or leading whitespace in the names, Also, there’s inconsistent capitalization: someone may write Software Engineer while others write Software engineer. So let’s trim the whitespace and turn everything lowercase.

This involves adding another useful package, stringr (for string manipulations), and using another dplyr function, mutate. mutate alters a column of a data frame, or adds a new one.

library(stringr)

cleaned <- CV_sections %>%
  filter(StartYear >= 1996, StartYear < 2016) %>%
  mutate(Name = str_trim(str_to_lower(Name)))

cleaned
## Source: local data frame [525,643 x 7]
## 
##       Id  CVId                               Name
##    (int) (int)                              (chr)
## 1      2     1                             intern
## 2      6     3               technical evangelist
## 3      8     4                    marathon runner
## 4     14     7                         chief geek
## 5     18     9                  software engineer
## 6     32    15                         programmer
## 7     36    17                     lead developer
## 8     38    18 senior electrical project engineer
## 9     40    19                     .net developer
## 10    42    20                     student mentor
## ..   ...   ...                                ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
##   (lgl)

str_trim trims whitespace, and str_to_lower turns it lowercase. From now on, we’ll be using the cleaned dataset in our analyses.

Summarizing data with count and group_by/summarize

Now that we have a clean dataset, let’s start with a simple question. What are the most common job titles? We can answer that with the count function from the dplyr package.

cleaned %>%
  count(Name, sort = TRUE)
## Source: local data frame [132,879 x 2]
## 
##                         Name     n
##                        (chr) (int)
## 1          software engineer 35467
## 2         software developer 25624
## 3              web developer 19155
## 4   senior software engineer 12699
## 5                  developer 11604
## 6           senior developer  5605
## 7                 consultant  5128
## 8                 programmer  5107
## 9  senior software developer  4808
## 10                    intern  4119
## ..                       ...   ...

This aggregates the data frame into the unique values of Name, sorting so that the most common are first. Notice that we wanted to count the Name column, but we didn’t put Name in quotes.

Similarly, if we’d wanted to count the number of jobs started in each year, we could have done:

cleaned %>%
  count(StartYear, sort = TRUE)
## Source: local data frame [20 x 2]
## 
##    StartYear     n
##        (int) (int)
## 1       2011 56673
## 2       2012 55948
## 3       2010 50253
## 4       2013 48187
## 5       2014 39541
## 6       2009 38607
## 7       2008 38480
## 8       2007 34795
## 9       2006 28114
## 10      2015 22945
## 11      2005 22944
## 12      2004 17676
## 13      2003 13443
## 14      2000 11661
## 15      2001 11092
## 16      2002 10971
## 17      1999  8435
## 18      1998  6690
## 19      1997  5208
## 20      1996  3980

Breaking up tag data

In my 1/29/16 Tiny Talk, I did some analyses of how tags have grown or shrunk in popularity over the last twenty years. Let’s reproduce a simple version of that analysis.

First of all, you don’t need all the columns, just the CV section ID, the starting year, and the vector of tags. This is done with the select function in dplyr:

cleaned %>%
  select(Id, StartYear, Tags)
## Source: local data frame [525,643 x 3]
## 
##       Id StartYear
##    (int)     (int)
## 1      2      2001
## 2      6      2005
## 3      8      2004
## 4     14      2008
## 5     18      2008
## 6     32      2008
## 7     36      2000
## 8     38      2005
## 9     40      2007
## 10    42      2005
## ..   ...       ...
## Variables not shown: Tags (chr)

Notice that just like SQL’s SELECT, dplyr’s select extracts only specified columns from a dataset. Note also that some jobs have no tags, so for this analysis we filter them out:

cleaned %>%
  select(Id, StartYear, Tags) %>%
  filter(Tags != "")
## Source: local data frame [342,883 x 3]
## 
##       Id StartYear
##    (int)     (int)
## 1      2      2001
## 2      6      2005
## 3      8      2004
## 4     14      2008
## 5     18      2008
## 6     32      2008
## 7     36      2000
## 8     38      2005
## 9     40      2007
## 10    42      2005
## ..   ...       ...
## Variables not shown: Tags (chr)

Now, right now all the tags are combined together within each string, which makes it impossible to count them over time. We want to reorganize the data frame so that there’s one-row-per-tag-per-cv. This requires the help of the tidyr package.

library(tidyr)

by_tag <- cleaned %>%
  select(Id, StartYear, Tags) %>%
  filter(Tags != "") %>%
  mutate(Tags = str_split(Tags, " ")) %>%
  unnest(Tags) %>%
  rename(Tag = Tags)

by_tag
## Source: local data frame [2,159,204 x 3]
## 
##       Id StartYear         Tag
##    (int)     (int)       (chr)
## 1      2      2001     asp.net
## 2      2      2001      vb.net
## 3      2      2001      oracle
## 4      2      2001        html
## 5      2      2001  javascript
## 6      6      2005 silverlight
## 7      6      2005          c#
## 8      6      2005     asp.net
## 9      8      2004     insoles
## 10    14      2008    facebook
## ..   ...       ...         ...

That pair of steps- mutate/str_split and then unnest- function is a little like LINQ’s SelectMany. We split each string into a vector around the " " character, then we “unnested” them to make an even taller data frame (notice there are now 2159204 rows). The last thing I did is rename the column from Tags to Tag since it is no longer plural.

We now have a data frame that has one-row-per-tag-per-CV, along with the starting year of that CV. For starters, we can simply ask what the most common tags on CVs are!

tag_counts <- by_tag %>%
  count(Tag, sort = TRUE)

tag_counts
## Source: local data frame [43,719 x 2]
## 
##           Tag     n
##         (chr) (int)
## 1  javascript 90806
## 2        java 70615
## 3         php 63695
## 4          c# 62744
## 5       mysql 52205
## 6        html 49043
## 7      jquery 47043
## 8         css 46893
## 9         c++ 33510
## 10     python 33168
## ..        ...   ...

This isn’t so far from the most common tags on Stack Overflow. We’ll take a look into that later.

Tags fraction in year

tags_by_year <- by_tag %>%
  count(StartYear, Tag)

tags_by_year
## Source: local data frame [142,520 x 3]
## Groups: StartYear [?]
## 
##    StartYear           Tag     n
##        (int)         (chr) (int)
## 1       1996          .net    44
## 2       1996      .net-1.0     1
## 3       1996      .net-2.0     1
## 4       1996 .net-remoting     1
## 5       1996        .netcf     1
## 6       1996           1.0     1
## 7       1996           1.1     1
## 8       1996        1.x2.x     1
## 9       1996             2     2
## 10      1996           2.0     1
## ..       ...           ...   ...

Knowing that there were 44 jobs starting in 1996 with the .net tag is interesting, but it’s not what we really want to view. We want to know the fraction of jobs that year that contained this tag.

This means we’ll have to do a bit more interesting summarization, using group_by. This works a little like SQL’s GROUP BY. If we follow it with a mutate, the mutate will occur once for each unique group:

by_tag %>%
  group_by(StartYear) %>%
  mutate(YearTotal = n_distinct(Id))
## Source: local data frame [2,159,204 x 4]
## Groups: StartYear [20]
## 
##       Id StartYear         Tag YearTotal
##    (int)     (int)       (chr)     (int)
## 1      2      2001     asp.net      5951
## 2      2      2001      vb.net      5951
## 3      2      2001      oracle      5951
## 4      2      2001        html      5951
## 5      2      2001  javascript      5951
## 6      6      2005 silverlight     13088
## 7      6      2005          c#     13088
## 8      6      2005     asp.net     13088
## 9      8      2004     insoles      9719
## 10    14      2008    facebook     23540
## ..   ...       ...         ...       ...

The n_distinct function tells how many distinct values of CVId are in each group (each year). That’s the data we’ll need, so we count from there:

tags_by_year <- by_tag %>%
  group_by(StartYear) %>%
  mutate(YearTotal = n_distinct(Id)) %>%
  count(StartYear, YearTotal, Tag) %>%
  mutate(Percent = n / YearTotal)

tags_by_year
## Source: local data frame [142,520 x 5]
## Groups: StartYear, YearTotal [20]
## 
##    StartYear YearTotal           Tag     n      Percent
##        (int)     (int)         (chr) (int)        (dbl)
## 1       1996      1979          .net    44 0.0222334512
## 2       1996      1979      .net-1.0     1 0.0005053057
## 3       1996      1979      .net-2.0     1 0.0005053057
## 4       1996      1979 .net-remoting     1 0.0005053057
## 5       1996      1979        .netcf     1 0.0005053057
## 6       1996      1979           1.0     1 0.0005053057
## 7       1996      1979           1.1     1 0.0005053057
## 8       1996      1979        1.x2.x     1 0.0005053057
## 9       1996      1979             2     2 0.0010106114
## 10      1996      1979           2.0     1 0.0005053057
## ..       ...       ...           ...   ...          ...

Now we have the percentage of jobs in each year with that tag. Now we can actually make some graphs.

Visualizing tags over time

This is far too much data to look at manually. But we can start by picking a programming language, and filtering just for it.

js_tag <- tags_by_year %>%
  filter(Tag == "javascript")

js_tag
## Source: local data frame [20 x 5]
## Groups: StartYear, YearTotal [20]
## 
##    StartYear YearTotal        Tag     n   Percent
##        (int)     (int)      (chr) (int)     (dbl)
## 1       1996      1979 javascript   198 0.1000505
## 2       1997      2601 javascript   314 0.1207228
## 3       1998      3438 javascript   512 0.1489238
## 4       1999      4342 javascript   800 0.1842469
## 5       2000      6142 javascript  1244 0.2025399
## 6       2001      5951 javascript  1155 0.1940850
## 7       2002      5912 javascript  1112 0.1880920
## 8       2003      7283 javascript  1422 0.1952492
## 9       2004      9719 javascript  1946 0.2002264
## 10      2005     13088 javascript  2808 0.2145477
## 11      2006     16259 javascript  3738 0.2299034
## 12      2007     20873 javascript  5079 0.2433287
## 13      2008     23540 javascript  6026 0.2559898
## 14      2009     23928 javascript  6252 0.2612839
## 15      2010     32411 javascript  8641 0.2666070
## 16      2011     38046 javascript 10776 0.2832361
## 17      2012     39976 javascript 12049 0.3014058
## 18      2013     36483 javascript 11402 0.3125291
## 19      2014     31405 javascript  9560 0.3044101
## 20      2015     19507 javascript  5772 0.2958938

Looking at this manually we can start to see a pattern. But it is much easier to visualize it. To do this, we use ggplot2.

library(ggplot2)
ggplot(js_tag, aes(x = StartYear, y = Percent)) +
  geom_line()
plot of chunk unnamed-chunk-14

plot of chunk unnamed-chunk-14

There are three parts to a ggplot2 call:

  • data being plotted: in this case, the js_tag data frame we had just created
  • mapping of attributes in the data to aesthetic elements the graph: this is defined by aes(x = StartYear, y = Percent). This tells the graph the variables from the data frame that we want on the x and y axes.
  • layers: in this case, we add (+) a geom_point() layer. This tells it we want a line plot, which is appropriate for showing a changing trend over time. If we had instead wanted a scatter plot, we could have done:
ggplot(js_tag, aes(x = StartYear, y = Percent)) +
  geom_point()
plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-15

Or a bar plot with:

ggplot(js_tag, aes(x = StartYear, y = Percent)) +
  geom_bar(stat = "identity")
plot of chunk unnamed-chunk-16

plot of chunk unnamed-chunk-16

(Don’t worry about stat = "identity" for now, it’s just telling ggplot2 that we want a bar plot rather than a histogram). That’s just plotting one single tag. We could filter for multiple tags using the %in% operator:

multiple_tags <- tags_by_year %>%
  filter(Tag %in% c("javascript", "python", "java", "c++"))

multiple_tags
## Source: local data frame [80 x 5]
## Groups: StartYear, YearTotal [20]
## 
##    StartYear YearTotal        Tag     n    Percent
##        (int)     (int)      (chr) (int)      (dbl)
## 1       1996      1979        c++   560 0.28297120
## 2       1996      1979       java   329 0.16624558
## 3       1996      1979 javascript   198 0.10005053
## 4       1996      1979     python    43 0.02172815
## 5       1997      2601        c++   681 0.26182238
## 6       1997      2601       java   412 0.15840062
## 7       1997      2601 javascript   314 0.12072280
## 8       1997      2601     python    62 0.02383699
## 9       1998      3438        c++   801 0.23298429
## 10      1998      3438       java   626 0.18208261
## ..       ...       ...        ...   ...        ...

(c("javascript", "python"...) defines a character vector, since c stands for “combine”). If we want to plot this, we have to separate these four tags from each other in some way. Let’s choose to distinguish them with color:

ggplot(multiple_tags, aes(StartYear, Percent, color = Tag)) +
  geom_line()
plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-17

Suppose instead of just choosing four tags, we want to plot a bunch of the most common- say, all the ones that appear on more than 20,000 jobs.

common_tags <- tags_by_year %>%
  group_by(Tag) %>%
  mutate(TagTotal = sum(n)) %>%
  ungroup() %>%
  filter(TagTotal > 20000)

common_tags
## Source: local data frame [320 x 6]
## 
##    StartYear YearTotal        Tag     n     Percent TagTotal
##        (int)     (int)      (chr) (int)       (dbl)    (int)
## 1       1996      1979       .net    44 0.022233451    22296
## 2       1996      1979    asp.net    46 0.023244063    30008
## 3       1996      1979         c#   113 0.057099545    62744
## 4       1996      1979        c++   560 0.282971198    33510
## 5       1996      1979        css    87 0.043961597    46893
## 6       1996      1979       html   246 0.124305205    49043
## 7       1996      1979      html5    15 0.007579586    20090
## 8       1996      1979       java   329 0.166245579    70615
## 9       1996      1979 javascript   198 0.100050531    90806
## 10      1996      1979     jquery    47 0.023749368    47043
## ..       ...       ...        ...   ...         ...      ...
ggplot(common_tags, aes(StartYear, Percent, color = Tag)) +
  geom_line()
plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-18

The color legend was useful when we had four lines, but now they’re almost impossible to tell apart, and we have to make a different visualization choice. Let’s instead create a sub-graph for each tag.

ggplot(common_tags, aes(StartYear, Percent)) +
  geom_line() +
  facet_wrap(~Tag)