Note: this was originally an internal Stack Overflow tutorial. This analysis won’t work without the sqlstackr package and database access, meaning it’s only intended as an example of internal documentation.
This is a basic tutorial showing how to use R to request, manipulate, and visualize information from the Careers database. This uses a combination of:
This combination of packages is powerful for performing analyses and answering statistical questions about our data, especially involving statistical tests and graphics.
Let’s say I want to know some information about people’s CV Sections: that is, the jobs on their resume. I could come up with a SQL query something like this:
library(dplyr)
library(sqlstackr)
q <- "select Id, CVId, Name, Tags, StartYear, EndYear, IsCurrent from CVSections
where CVSectionTypeId = 2 and
StartYear is not null"
CV_sections <- query_Careers(q, collect = TRUE)
This query_Careers function is sqlstackr’s function for sending a SQL query to the Careers database. (If you’re wondering, there’s also query_StackOverflow for core Q&A, query_HAProxyLogs for traffic data, query_Calculon for ad server data, and so on). collect = TRUE tells the function to bring the results of the query in memory into R (otherwise it will just supply a preview).
Now, if you print this (by putting CV_sections into your terminal) you’ll see the first few rows:
CV_sections
## Source: local data frame [540,310 x 7]
##
## Id CVId Name
## (int) (int) (chr)
## 1 2 1 Intern
## 2 6 3 Technical Evangelist
## 3 8 4 Marathon Runner
## 4 14 7 Chief Geek
## 5 18 9 Software Engineer
## 6 28 13 Software Engineer
## 7 32 15 Programmer
## 8 36 17 Lead Developer
## 9 38 18 Senior Electrical Project Engineer
## 10 40 19 .NET Developer
## .. ... ... ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
## (lgl)
This is a data frame, which is the most common data type in R you’ll be working with. You can also view it like an spreadsheet by doing:
View(CV_sections)
Before we do any analysis, we usually have to do a bit of data-cleaning. We’ll do this with functions from the dplyr package.
One data cleaning issue is that people put really wacky years for their Start and End dates, like “1754” or “2110”. This will mess with our analyses and graphs. Let’s limit it to a range where most of our valid resume items are. We can do this with filter, which filters for particular rows of the table based on a condition (like a “where” statement in SQL or LINQ):
cleaned <- filter(CV_sections, StartYear >= 1996, StartYear < 2016)
cleaned
## Source: local data frame [525,643 x 7]
##
## Id CVId Name
## (int) (int) (chr)
## 1 2 1 Intern
## 2 6 3 Technical Evangelist
## 3 8 4 Marathon Runner
## 4 14 7 Chief Geek
## 5 18 9 Software Engineer
## 6 32 15 Programmer
## 7 36 17 Lead Developer
## 8 38 18 Senior Electrical Project Engineer
## 9 40 19 .NET Developer
## 10 42 20 Student Mentor
## .. ... ... ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
## (lgl)
(We omit 2016 since it’s early in the year and results will be unusual). Notice that before there were 540310 rows, but now there are 525643 rows.
You should get into the habit of writing this function call a different way:
CV_sections %>%
filter(StartYear >= 1996, StartYear < 2016)
## Source: local data frame [525,643 x 7]
##
## Id CVId Name
## (int) (int) (chr)
## 1 2 1 Intern
## 2 6 3 Technical Evangelist
## 3 8 4 Marathon Runner
## 4 14 7 Chief Geek
## 5 18 9 Software Engineer
## 6 32 15 Programmer
## 7 36 17 Lead Developer
## 8 38 18 Senior Electrical Project Engineer
## 9 40 19 .NET Developer
## 10 42 20 Student Mentor
## .. ... ... ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
## (lgl)
This %>% comes with dplyr and is called a “pipe operator”: it simply means “insert the CV_sections object as the first argument of count. Why do this? Because it allows us to chain many dplyr operations together: without having nested functions. You’ll see this become useful soon.
Another data-cleaning issue is that there is sometimes trailing or leading whitespace in the names, Also, there’s inconsistent capitalization: someone may write Software Engineer while others write Software engineer. So let’s trim the whitespace and turn everything lowercase.
This involves adding another useful package, stringr (for string manipulations), and using another dplyr function, mutate. mutate alters a column of a data frame, or adds a new one.
library(stringr)
cleaned <- CV_sections %>%
filter(StartYear >= 1996, StartYear < 2016) %>%
mutate(Name = str_trim(str_to_lower(Name)))
cleaned
## Source: local data frame [525,643 x 7]
##
## Id CVId Name
## (int) (int) (chr)
## 1 2 1 intern
## 2 6 3 technical evangelist
## 3 8 4 marathon runner
## 4 14 7 chief geek
## 5 18 9 software engineer
## 6 32 15 programmer
## 7 36 17 lead developer
## 8 38 18 senior electrical project engineer
## 9 40 19 .net developer
## 10 42 20 student mentor
## .. ... ... ...
## Variables not shown: Tags (chr), StartYear (int), EndYear (int), IsCurrent
## (lgl)
str_trim trims whitespace, and str_to_lower turns it lowercase. From now on, we’ll be using the cleaned dataset in our analyses.
Now that we have a clean dataset, let’s start with a simple question. What are the most common job titles? We can answer that with the count function from the dplyr package.
cleaned %>%
count(Name, sort = TRUE)
## Source: local data frame [132,879 x 2]
##
## Name n
## (chr) (int)
## 1 software engineer 35467
## 2 software developer 25624
## 3 web developer 19155
## 4 senior software engineer 12699
## 5 developer 11604
## 6 senior developer 5605
## 7 consultant 5128
## 8 programmer 5107
## 9 senior software developer 4808
## 10 intern 4119
## .. ... ...
This aggregates the data frame into the unique values of Name, sorting so that the most common are first. Notice that we wanted to count the Name column, but we didn’t put Name in quotes.
Similarly, if we’d wanted to count the number of jobs started in each year, we could have done:
cleaned %>%
count(StartYear, sort = TRUE)
## Source: local data frame [20 x 2]
##
## StartYear n
## (int) (int)
## 1 2011 56673
## 2 2012 55948
## 3 2010 50253
## 4 2013 48187
## 5 2014 39541
## 6 2009 38607
## 7 2008 38480
## 8 2007 34795
## 9 2006 28114
## 10 2015 22945
## 11 2005 22944
## 12 2004 17676
## 13 2003 13443
## 14 2000 11661
## 15 2001 11092
## 16 2002 10971
## 17 1999 8435
## 18 1998 6690
## 19 1997 5208
## 20 1996 3980
In my 1/29/16 Tiny Talk, I did some analyses of how tags have grown or shrunk in popularity over the last twenty years. Let’s reproduce a simple version of that analysis.
First of all, you don’t need all the columns, just the CV section ID, the starting year, and the vector of tags. This is done with the select function in dplyr:
cleaned %>%
select(Id, StartYear, Tags)
## Source: local data frame [525,643 x 3]
##
## Id StartYear
## (int) (int)
## 1 2 2001
## 2 6 2005
## 3 8 2004
## 4 14 2008
## 5 18 2008
## 6 32 2008
## 7 36 2000
## 8 38 2005
## 9 40 2007
## 10 42 2005
## .. ... ...
## Variables not shown: Tags (chr)
Notice that just like SQL’s SELECT, dplyr’s select extracts only specified columns from a dataset. Note also that some jobs have no tags, so for this analysis we filter them out:
cleaned %>%
select(Id, StartYear, Tags) %>%
filter(Tags != "")
## Source: local data frame [342,883 x 3]
##
## Id StartYear
## (int) (int)
## 1 2 2001
## 2 6 2005
## 3 8 2004
## 4 14 2008
## 5 18 2008
## 6 32 2008
## 7 36 2000
## 8 38 2005
## 9 40 2007
## 10 42 2005
## .. ... ...
## Variables not shown: Tags (chr)
Now, right now all the tags are combined together within each string, which makes it impossible to count them over time. We want to reorganize the data frame so that there’s one-row-per-tag-per-cv. This requires the help of the tidyr package.
library(tidyr)
by_tag <- cleaned %>%
select(Id, StartYear, Tags) %>%
filter(Tags != "") %>%
mutate(Tags = str_split(Tags, " ")) %>%
unnest(Tags) %>%
rename(Tag = Tags)
by_tag
## Source: local data frame [2,159,204 x 3]
##
## Id StartYear Tag
## (int) (int) (chr)
## 1 2 2001 asp.net
## 2 2 2001 vb.net
## 3 2 2001 oracle
## 4 2 2001 html
## 5 2 2001 javascript
## 6 6 2005 silverlight
## 7 6 2005 c#
## 8 6 2005 asp.net
## 9 8 2004 insoles
## 10 14 2008 facebook
## .. ... ... ...
That pair of steps- mutate/str_split and then unnest- function is a little like LINQ’s SelectMany. We split each string into a vector around the " " character, then we “unnested” them to make an even taller data frame (notice there are now 2159204 rows). The last thing I did is rename the column from Tags to Tag since it is no longer plural.
We now have a data frame that has one-row-per-tag-per-CV, along with the starting year of that CV. For starters, we can simply ask what the most common tags on CVs are!
tag_counts <- by_tag %>%
count(Tag, sort = TRUE)
tag_counts
## Source: local data frame [43,719 x 2]
##
## Tag n
## (chr) (int)
## 1 javascript 90806
## 2 java 70615
## 3 php 63695
## 4 c# 62744
## 5 mysql 52205
## 6 html 49043
## 7 jquery 47043
## 8 css 46893
## 9 c++ 33510
## 10 python 33168
## .. ... ...
This isn’t so far from the most common tags on Stack Overflow. We’ll take a look into that later.
Let’s move on to a different topic. Suppose we wanted to compare how common tags are on CVs as opposed to questions on Stack Overflow. Earlier we used query_Careers to send a SQL request to Careers. Similarly, we can use query_StackOverflow to send a SQL request to the Stack Overflow database:
SO_tags <- query_StackOverflow("select Name as Tag, Count as Questions from Tags",
collect = TRUE)
SO_tags
## Source: local data frame [43,940 x 2]
##
## Tag Questions
## (chr) (int)
## 1 .a 82
## 2 .app 92
## 3 .aspxauth 48
## 4 .bash-profile 390
## 5 .class-file 181
## 6 .cs-file 34
## 7 .doc 107
## 8 .emf 58
## 9 .git-info-grafts 3
## 10 .hgtags 7
## .. ... ...
Notice that in the SQL query I named the columns informatively as Tag (to match our other table) and Questions. We now have one table showing the number of SO questions per tag, and another showing appearances on CVs. We can use dplyr’s inner_join to combine them, much like a SQL JOIN statement:
tag_counts_joined <- tag_counts %>%
rename(CVTags = n) %>%
inner_join(SO_tags, by = "Tag")
tag_counts_joined
## Source: local data frame [16,435 x 3]
##
## Tag CVTags Questions
## (chr) (int) (int)
## 1 javascript 90806 1046896
## 2 java 70615 1009502
## 3 php 63695 872316
## 4 c# 62744 907234
## 5 mysql 52205 375010
## 6 html 49043 502082
## 7 jquery 47043 703534
## 8 css 46893 366481
## 9 c++ 33510 425925
## 10 python 33168 530805
## .. ... ... ...
(Note that I renamed n to CVTags before joining to make this comparison table more informative).
We’ve again reached a step where looking at the first few rows of a table is insufficient to make conclusions. That’s when it’s time to start plotting.
library(ggplot2)
ggplot(tag_counts_joined, aes(x = Questions, y = CVTags)) +
geom_point()
plot of chunk unnamed-chunk-22
Just like that, we have a scatterplot of number of SO questions vs number of CV tags. But we don’t know which point is which. So let’s add a second layer to the plot, a geom_text layer:
ggplot(tag_counts_joined, aes(x = Questions, y = CVTags)) +
geom_point() +
geom_text(aes(label = Tag), vjust = 1, hjust = 1, check_overlap = TRUE)
plot of chunk unnamed-chunk-23
Notice that we defined a new aesthetic mapping for this layer: we want the label on each of the points to be based on the Tag column of the data. We also set three options: vjust and hjust told it to go slightly to the left and below each point (rather than on top of it), and check_overlap = TRUE told it not to add two labels if they would be on top of each other.
One problem with this graph is that so much of the interesting stuff is crammed into the lower left corner. So we make another change.
ggplot(tag_counts_joined, aes(Questions, CVTags)) +
geom_point() +
geom_text(aes(label = Tag), vjust = 1, hjust = 1, check_overlap = TRUE) +
scale_x_log10() +
scale_y_log10()
plot of chunk unnamed-chunk-24
The scale_x_log10() and scale_y_log10() are not extra layers, they are options that change the nature of the plot. In this case, they put both x and y on a log scale.
There’s still a lot of this plot taken up with points/tags we don’t care about- ones with only a couple of questions or a couple of tags. So we remove all those that don’t have at least 1000 SO questions and 1000 CV tags.
tag_counts_common <- tag_counts_joined %>%
filter(Questions > 1000 | CVTags > 1000)
ggplot(tag_counts_common, aes(Questions, CVTags)) +
geom_point() +
geom_text(aes(label = Tag), vjust = 1, hjust = 1, check_overlap = TRUE) +
scale_x_log10() +
scale_y_log10()
plot of chunk tag_counts_common
There’s some insights we can gain from this. As one example, there are far more r questions than amazon-web-services on SO, but more amazon-web-services tags on CVs. (I suspect this is because SO is often visited by academics and scientists, while Careers CVs are made up mostly of software developers). Some project management techniques like scrum have many appearances in jobs but very few SO questions, while technical ones like file, string, and list have many questions but almost no CV tags.
One question I get from developers is “Why use R to analyze databases when you can just use SQL queries?” Notice, indeed, that much of the dplyr syntax mirrors SQL operations.
Well, here’s some ways this analysis was possible only in R:
unnest step to separate tags in CVSections. This kind of string parsing and reshaping would have been quite challenging to do in SQL alone, but was absolutely necessary to start counting tags per year.inner_join, since StackOverflow and Careers live on different databases and even servers, so joining them in SQL is not straightforward. This is just the start of the data integration we can do: we could even have brought in JSON or XML information that could never plausibly have made it into a SQL query.We didn’t even start using the statistical analysis tools, including modeling and machine learning methods, that are present in R.