1 Which are the most valued data science skills?

In this project, we try to answer this question by examining which words describing data science skills appear most often in several data science sources. In essence, we will use word frequency as a proxy measure for the value of data science skills: the higher the word frequency, the more valued the skill. This approach relies on three key assumptions:

  • The sources that we examine are representative of the data science field.
  • Data science skills that are more highly valued will be discussed more often in these sources.
  • We can correctly map from the words that are used in our sources to the distinct skills described by the words.

These assumptions should be highlighted, since we know that they are not totally correct. First, the sources that we choose may bias or skew our sample results; for instance, a book on programming will tend to have many words relating to technical skills and analytics, rather than “soft skills” such as team work and communications. Second, there are many other measures besides word count, such as salary information on job postings, that can be used as proxy measures; these other proxy measures give an alternative view that may very well give different results. Third, the mapping from individual words to specific data science skills is subjective and open to interpretation; also a single word may be used to describe different skills depending on the context.

So to address the main question of our project, we start by answering a more specific question:
“Which words relating to data science skills occur most frequently in a couple of representative data science sources?”

Once we have the answer to this specific question, we will see how that can be applied to answer the general question.

2 Working Approach

2.1 Team roles and collaboration

We worked as a two-person team over the course of a week.

  • Roles: We both contributed to the project, and jointly discussed ideas and decided on approach. Our division of labor ended up as:
    • Anne focused on methods to scrape possible data sources and load the data
    • Kevin focused on the word frequency analysis and the write-up
  • Communication: We generally touched base daily by telephone or Slack, in order to update each other on progress made, challenges, and next steps.

  • Collaboration tools: We shared R code, text files, and an R Markdown file using a project repository set up on GitHub, at https://github.com/aschwenker/project3.

2.2 Problem-solving approach

Once we decided on the overall approach, we did some preliminary research online to see what tools and methods would be most effective for text mining and word frequency analysis. We reviewed several approaches, and found that in order to implement our approach, we would need to use several packages. The packages included:

  • Loading the data
    • tidyRSS package: to download an RSS feed and extract the data
    • readr package: to download a file from the internet
    • pdftools package: to extract text from a pdf file
  • Text mining the data
    • tm package: to tokenize a character vector
    • tau package: to count word patterns from character vectors
# load packages
library(tidyRSS)
library(readr)
library(pdftools)
library(tm)
library(tau)
library(plyr)
library(dplyr)
library(knitr)
library(ggplot2)

3 Data Collection

We decided to use a couple sources relating to data science for the word frequency analysis:

Our logic was that the first source would represent stories in the general media about data science skills, while the second source would give a more academic or technical perspective.

3.1 RSS feed

In order to access the RSS feed, we used the tidyRSS::tidyfeed function to read in the data.

# read in google alert RSS feed
data_science_skills <- tidyfeed("https://www.google.com/alerts/feeds/00182648300908928214/18036739630504927351", sf = TRUE)
names(data_science_skills)
## [1] "feed_title"        "feed_link"         "feed_last_updated"
## [4] "item_title"        "item_date_updated" "item_link"        
## [7] "item_content"
data_science_skills$item_link
## [1] "tag:google.com,2013:googlealerts/feed:17356835139618477013"
## [2] "tag:google.com,2013:googlealerts/feed:2436643227745363368" 
## [3] "tag:google.com,2013:googlealerts/feed:6627674387286412333" 
## [4] "tag:google.com,2013:googlealerts/feed:877245457118443212"  
## [5] "tag:google.com,2013:googlealerts/feed:8818965960477480591" 
## [6] "tag:google.com,2013:googlealerts/feed:18031683177660808970"
data_science_skills$item_content
## [1] "To remain a relevant sector in the Fourth Industrial Revolution (FIRe), persons with disabilities (PWDs) must equip themselves with <b>science</b> and&nbsp;..."                             
## [2] "Developers want to learn the <b>data sciences</b>. They see machine learning and <b>data</b> <b>science</b> as the most important <b>skill</b> they need to learn in the year&nbsp;..."      
## [3] "Our graduate diploma in <b>data science</b> is designed for graduates wishing to broaden their <b>skill</b>-set enabling, them to become competent and confident&nbsp;..."                   
## [4] "New or upgraded <b>skills</b> are needed in many key areas of work, especially those developing digital capabilities, the use of <b>data</b> and the <b>science</b> of <b>data</b>,&nbsp;..."
## [5] "We have an opportunity for you to use your <b>analytical skills</b> to improve the Department of Defense&#39;s management of multiple <b>data</b> sources. You&#39;ll work&nbsp;..."         
## [6] "We have an opportunity for you to use your leadership and <b>analytical skills</b> to improve the Department of Defense&#39;s management of multiple <b>data</b>&nbsp;..."
# load story summaries
title_vector <- c(t(data_science_skills$item_content))
title_vector
## [1] "To remain a relevant sector in the Fourth Industrial Revolution (FIRe), persons with disabilities (PWDs) must equip themselves with <b>science</b> and&nbsp;..."                             
## [2] "Developers want to learn the <b>data sciences</b>. They see machine learning and <b>data</b> <b>science</b> as the most important <b>skill</b> they need to learn in the year&nbsp;..."      
## [3] "Our graduate diploma in <b>data science</b> is designed for graduates wishing to broaden their <b>skill</b>-set enabling, them to become competent and confident&nbsp;..."                   
## [4] "New or upgraded <b>skills</b> are needed in many key areas of work, especially those developing digital capabilities, the use of <b>data</b> and the <b>science</b> of <b>data</b>,&nbsp;..."
## [5] "We have an opportunity for you to use your <b>analytical skills</b> to improve the Department of Defense&#39;s management of multiple <b>data</b> sources. You&#39;ll work&nbsp;..."         
## [6] "We have an opportunity for you to use your leadership and <b>analytical skills</b> to improve the Department of Defense&#39;s management of multiple <b>data</b>&nbsp;..."

Note that the content of the RSS feed only includes story summaries (truncated after approximately 150-180 characters), so this will limit our word counts.

Now that we have a collection of text data, we can use the tm and tau package to do the word count. We adapted code we found online (https://www.codementor.io/jhwatts2010/counting-words-with-r-ds35hzgmj) for this part of the analysis. In counting the word frequencies, we want to focus on “content” words that convey meaning, rather than words that relate to syntax, grammar, pronouns, prepositions, etc. To help do this, tm offers an option to exclude a common set of “stop words” from the word count.

# remove stop words
stop_words <- TRUE
# show list of stop words to exclude from word count
sort(sample(tm::stopwords("SMART"), 100))
##   [1] "anyhow"       "apart"        "appropriate"  "at"          
##   [5] "because"      "before"       "behind"       "besides"     
##   [9] "between"      "beyond"       "both"         "c'mon"       
##  [13] "causes"       "certainly"    "clearly"      "co"          
##  [17] "com"          "comes"        "consequently" "during"      
##  [21] "edu"          "eg"           "etc"          "except"      
##  [25] "following"    "forth"        "go"           "goes"        
##  [29] "gotten"       "h"            "happens"      "has"         
##  [33] "hence"        "himself"      "his"          "i"           
##  [37] "i'd"          "i'll"         "inasmuch"     "inc"         
##  [41] "inward"       "is"           "isn't"        "it's"        
##  [45] "its"          "keep"         "kept"         "lately"      
##  [49] "later"        "latter"       "least"        "liked"       
##  [53] "likely"       "many"         "more"         "much"        
##  [57] "nd"           "need"         "needs"        "new"         
##  [61] "no"           "nobody"       "none"         "novel"       
##  [65] "nowhere"      "outside"      "p"            "quite"       
##  [69] "rather"       "reasonably"   "relatively"   "respectively"
##  [73] "said"         "says"         "secondly"     "seeing"      
##  [77] "soon"         "t"            "t's"          "tends"       
##  [81] "than"         "thank"        "thats"        "the"         
##  [85] "thereby"      "theres"       "though"       "truly"       
##  [89] "up"           "upon"         "was"          "went"        
##  [93] "weren't"      "what"         "whatever"     "where"       
##  [97] "with"         "yes"          "you've"       "z"

Next we use the tm::scan_tokenizer and tau::textcnt functions to build up the word count, after excluding the stop words.

# tokenize the text
data <- tm::scan_tokenizer(title_vector)
# remove stop words
data <- if (stop_words == TRUE) tm::removeWords(data, tm::stopwords("SMART"))
# count words
data1 <- tau::textcnt(data, method = "string", n = 1L, lower = 1L)
str(data1)
##  'textcnt' Named int [1:15] 2 7 2 2 2 2 2 2 6 2 ...
##  - attr(*, "names")= chr [1:15] "analytical" "data" "defense" "department" ...
##  - attr(*, "useBytes")= logi FALSE
summary(data1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     2.0     2.0     2.8     2.5     7.0

3.2 Textbook

First we downloaded the PDF file of the book from the internet and saved it locally in order to do the analysis. Given the size of the book and time required to download it, we’ve commented out the download.file command in the code below. We also tried to save the PDF file in GitHub and download from there, but we encountered some difficulties, probably relating to the security protocol.

# download PDF of book from open website
# download.file("http://kek.ksu.ru/eos/WM/AutDataCollectR.pdf", "./AutDataCollectR.pdf")
text <- pdf_text("./AutDataCollectR.pdf")

# OR try loading PDF from GitHub
# but this doesn't seem to work
# url <- "https://github.com/aschwenker/project3/blob/master/AutDataCollectR.pdf"
# text <- pdf_text(getURL(url))

Next we apply the same procedure as above to tokenize the data and build up the word count, after excluding the stop words.

# tokenize the data
data <- tm::scan_tokenizer(text)
# remove stop words
data <- if (stop_words == TRUE) tm::removeWords(data, tm::stopwords("SMART"))
# count words
data2 <- tau::textcnt(data, method = "string", n = 1L, lower = 1L)
str(data2)
##  'textcnt' Named int [1:5990] 2 23 2 4 3 3 3 4 4 3 ...
##  - attr(*, "names")= chr [1:5990] "*" "-" "<U+2192>" "<U+22EE>" ...
##  - attr(*, "useBytes")= logi FALSE
summary(data2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    5.00   15.01   12.00 1589.00

4 Data Tidying and Transformations

4.1 RSS feed

Upon reviewing the data, we notice that the result of the word count processing is a textcnt data type, not a data frame. We use the ldply function to transform data1 into a single data frame of word counts, and then rename the columns.

# transform data into data frame and rename columns
Results1 <- data1 %>% ldply(data.frame) %>% rename("Word" = 1, "Frequency" = 2) %>% arrange(desc(Frequency))
Results1
##           Word Frequency
## 1         data         7
## 2         nbsp         6
## 3      science         4
## 4       skills         3
## 5   analytical         2
## 6      defense         2
## 7   department         2
## 8      improve         2
## 9        learn         2
## 10  management         2
## 11    multiple         2
## 12 opportunity         2
## 13       skill         2
## 14          we         2
## 15        work         2

In this case, because the RSS feed only gave us 5 summaries of articles, our word counts are relatively low.

4.2 Textbook

Likewise, we start by converting data2 into a data frame of word counts.

# transform data into data frame and rename columns
Results2 <- data2 %>% ldply(data.frame) %>% rename("Word" = 1, "Frequency" = 2) %>% arrange(desc(Frequency))
str(Results2)
## 'data.frame':    5990 obs. of  2 variables:
##  $ Word     : chr  "r" "data" "the" "html" ...
##  $ Frequency: int  1589 1435 950 726 684 660 604 554 544 491 ...
summary(Results2)
##      Word             Frequency      
##  Length:5990        Min.   :   2.00  
##  Class :character   1st Qu.:   2.00  
##  Mode  :character   Median :   5.00  
##                     Mean   :  15.01  
##                     3rd Qu.:  12.00  
##                     Max.   :1589.00

In this case, the word counts are much higher (they came from a book, after all!), so let’s filter the words based on their frequency. We choose a minimum and maximum word count, filter using the min and max word count, and then rename the columns.

# set minimum and maximum word frequency to display
min <- 200 
max <- 1600 

# filter based on min/max word frequency
Results2_f <- Results2 %>% filter(Frequency > min & Frequency <= max)
Results2_f
##           Word Frequency
## 1            r      1589
## 2         data      1435
## 3          the       950
## 4         html       726
## 5         http       684
## 6  information       660
## 7           we       604
## 8     function       554
## 9          web       544
## 10         xml       491
## 11        text       436
## 12        file       419
## 13         str       401
## 14          in       391
## 15       table       389
## 16    document       366
## 17         url       366
## 18        list       339
## 19        code       305
## 20     content       299
## 21           a       292
## 22      server       290
## 23        node       275
## 24        this       275
## 25     extract       270
## 26         www       269
## 27     section       262
## 28         set       261
## 29   functions       258
## 30    scraping       254
## 31       files       250
## 32     request       245
## 33     package       228
## 34       xpath       226
## 35       names       215
## 36          to       212
## 37           c       211
## 38        page       211
## 39     element       210
## 40        form       208
## 41       title       201

We notice that there are still some non-content words that the “stop words” option didn’t exclude, so we filter them out here to arrive at the final data frame.

# filter out non-content words
excl <- c("the", "a", "an", "we", "in", "out", "this", "that", "to", "from")
Results2_f <- Results2_f %>% filter(!(Word %in% excl))
Results2_f
##           Word Frequency
## 1            r      1589
## 2         data      1435
## 3         html       726
## 4         http       684
## 5  information       660
## 6     function       554
## 7          web       544
## 8          xml       491
## 9         text       436
## 10        file       419
## 11         str       401
## 12       table       389
## 13    document       366
## 14         url       366
## 15        list       339
## 16        code       305
## 17     content       299
## 18      server       290
## 19        node       275
## 20     extract       270
## 21         www       269
## 22     section       262
## 23         set       261
## 24   functions       258
## 25    scraping       254
## 26       files       250
## 27     request       245
## 28     package       228
## 29       xpath       226
## 30       names       215
## 31           c       211
## 32        page       211
## 33     element       210
## 34        form       208
## 35       title       201

5 Exploratory Data Analysis

5.1 RSS feed

Reviewing the Results1 data frame above, we can see that certain words are associated with data science skills while others are not. We use our (subjective) judgment to assign skills to these words and show the final result.

# assign skills to words
rss_skill <- c("Data analysis", "NA", "Scientific methods", "NA", "Analytical skills", "NA", "NA",
                "NA", "Learning & curiosity", "Management skills", "NA", "NA", "NA", "NA", "NA")
rss_df <- cbind(Results1, Skill = rss_skill)
rss_df %>% filter(Skill != "NA") %>% kable(align = "rcl", caption = "Top Data Science Skills from RSS Feed")
Top Data Science Skills from RSS Feed
Word Frequency Skill
data 7 Data analysis
science 4 Scientific methods
analytical 2 Analytical skills
learn 2 Learning & curiosity
management 2 Management skills

5.2 Textbook

We visualize the Results2_f data frame above using a bar chart, and notice that the words “R” and “data” are clear outliers. In order to illustrate the trend for the remaining words, we filter out the outliers and then graph the data as a radar plot.

# bar chart 
Results2_f %>% ggplot(aes(x = reorder(Word, Frequency), y = Frequency, fill = Word)) + 
    geom_bar(stat = "identity", show.legend = FALSE) + coord_flip() + 
    ggtitle(paste0("Word Frequency ", min, "-", max, " in ", "Textbook")) + 
    labs(x = NULL, y = "Frequency") 

# filter out outliers and show radar plot
Results2_f %>% filter(!(Word %in% c("r", "data"))) %>% 
    ggplot(aes(x = Word, y = Frequency, fill = Word)) + 
    geom_bar(stat = "identity", show.legend = FALSE) + coord_polar(theta = "x") + 
    ggtitle(paste0("Word Frequency ", min, "-", max, " in ", "Textbook")) + 
    labs(x = NULL, y = "Frequency")

As before, we use our judgment to assign skills to the words, and then display the list of most frequent words and their implied data science skills.

# assign skills to words
book_skill <- c("R programming", "Data analysis", "Web scraping", "Web scraping", "NA", "General programming", "Web scraping", "Web scraping", "Text manipulation", "NA", "Text manipulation", "Data analysis", "Web scraping", "NA", "Data analysis", "General programming", "NA", "NA", "Web scraping", "Text manipulation", "Web scraping", "NA", "NA", "General programming", "Web scraping", "NA", "Web scraping", "General programming", "Web scraping", "R programming", "General programming", "Web scraping", "General programming", "Web scraping", "Web scraping")

book_df <- cbind(Results2_f, Skill = book_skill)
book_df %>% filter(Skill != "NA") %>% kable(align = "rcl", caption = "Top Word Counts & Implied Data Science Skills from Textbook")
Top Word Counts & Implied Data Science Skills from Textbook
Word Frequency Skill
r 1589 R programming
data 1435 Data analysis
html 726 Web scraping
http 684 Web scraping
function 554 General programming
web 544 Web scraping
xml 491 Web scraping
text 436 Text manipulation
str 401 Text manipulation
table 389 Data analysis
document 366 Web scraping
list 339 Data analysis
code 305 General programming
node 275 Web scraping
extract 270 Text manipulation
www 269 Web scraping
functions 258 General programming
scraping 254 Web scraping
request 245 Web scraping
package 228 General programming
xpath 226 Web scraping
names 215 R programming
c 211 General programming
page 211 Web scraping
element 210 General programming
form 208 Web scraping
title 201 Web scraping

We can summarize the analysis by grouping the word entries by skill, and then sorting the skills by total word count.

book_df %>% filter(Skill != "NA") %>% group_by(Skill) %>% 
    summarize(Total_Count = sum(Frequency)) %>% arrange(desc(Total_Count)) %>%  
    kable(align = "rl", caption = "Top Data Science Skills from Textbook")
Top Data Science Skills from Textbook
Skill Total_Count
Web scraping 4700
Data analysis 2163
R programming 1804
General programming 1766
Text manipulation 1107

6 Conclusions

6.1 Findings from our analysis

From our word frequency analysis of the RSS feed and the textbook, we found that the most frequently mentioned skills are the following:

From RSS Feed From Textbook
1. Data analysis 1. Web scraping
2. Scientific methods 2. Data analysis
3 (tied). Analytical skills 3. R programming
3 (tied). Learning & curiosity 4. General programming
3 (tied). Management skills 5. Text manipulation

This suggests that these are the most important data science skills, based on the sources we analyzed. As mentioned in the introduction, we should highlight several caveats to the analysis:

  • Our findings are highly dependent on the sources we chose to analysis. In particular, it should come as no surprise that the top three skills from a textbook titled “Automated Data Collection with R” are web scraping, data analysis, and R programming.
  • Our analysis relies on the assumption that the most frequently discussed skills are the most highly valued. However, this may not be the case as measured by other metrics, e.g., salary information in the data science field by skillset.
  • In mapping the word frequencies to implied skills, we have used our subjective judgment. Other individuals may reasonably map these words to different implied skills.

Overall, however, this seems to be a legitimate set of skills that are highly valued in the data sicence field.

6.2 Lessons learned from our collaboration

We worked as a virtual team on this project. Lessons learned include the following:

  • Communication: both of us work full-time and have families, so finding time to sync up and discuss the project wasn’t trivial. We made use of Slack and Github to communicate and share our work, and when needed, a phone call to discuss issues in more detail.
  • Teamwork: both of us were focused on the work at hand, and didn’t have to deal with personality issues or coordination issues sometimes found in larger groups.
  • Complementary skill sets: because we have complementary skill sets, we were able to learn from each other, and we had an efficient division of labor.

6.3 Challenges encountered and suggestions for further analysis

We encountered several challenges as we worked on the project. Some of the challenges included:

  • RSS feed: we were only able to load 5 articles from the Google Alert RSS feed, and the content for each article only included a truncated summary. This limited the size of the dataset that we analyzed, as well as the word counts.
  • Textbook: because of the long time required to download the textbook from the internet source, we saved the textook in the GitHub project repository and tried accessing the textbook there. However, there seemed to be an issue with the security protocol, which prevented us from being able to use this approach.

Our suggestions for further analysis include:

  • RSS feed: it would be interesting to do a broader search across internet articles that include other search terms related to data science, and also to include the full article content (rather than summaries).
  • Other data sources: it would be good to use a variety of books or other content on general data science topics, rather than a single book on a specialized topic.
  • Other metrics: it would be interesting to see the results using other proxy measures. For instance, salary information on data science job postings could be analyzed by scraping a website like Indeed.com.