In this project, we try to answer this question by examining which words describing data science skills appear most often in several data science sources. In essence, we will use word frequency as a proxy measure for the value of data science skills: the higher the word frequency, the more valued the skill. This approach relies on three key assumptions:
These assumptions should be highlighted, since we know that they are not totally correct. First, the sources that we choose may bias or skew our sample results; for instance, a book on programming will tend to have many words relating to technical skills and analytics, rather than “soft skills” such as team work and communications. Second, there are many other measures besides word count, such as salary information on job postings, that can be used as proxy measures; these other proxy measures give an alternative view that may very well give different results. Third, the mapping from individual words to specific data science skills is subjective and open to interpretation; also a single word may be used to describe different skills depending on the context.
So to address the main question of our project, we start by answering a more specific question:
“Which words relating to data science skills occur most frequently in a couple of representative data science sources?”
Once we have the answer to this specific question, we will see how that can be applied to answer the general question.
We worked as a two-person team over the course of a week.
Communication: We generally touched base daily by telephone or Slack, in order to update each other on progress made, challenges, and next steps.
Collaboration tools: We shared R code, text files, and an R Markdown file using a project repository set up on GitHub, at https://github.com/aschwenker/project3.
Once we decided on the overall approach, we did some preliminary research online to see what tools and methods would be most effective for text mining and word frequency analysis. We reviewed several approaches, and found that in order to implement our approach, we would need to use several packages. The packages included:
tidyRSS package: to download an RSS feed and extract the datareadr package: to download a file from the internetpdftools package: to extract text from a pdf filetm package: to tokenize a character vectortau package: to count word patterns from character vectors# load packages
library(tidyRSS)
library(readr)
library(pdftools)
library(tm)
library(tau)
library(plyr)
library(dplyr)
library(knitr)
library(ggplot2)
We decided to use a couple sources relating to data science for the word frequency analysis:
RSS feed: First, we set up a Google Alert RSS feed using the search terms “data+science+skills”. This does a Google search of articles on the internet containing the search terms, and summarizes the results in an RSS feed that we could access at: https://www.google.com/alerts/feeds/00182648300908928214/18036739630504927351
Textbook: Second, we decided to use the PDF version of the textbook “Automated Data Collection with R”, which we found online at: http://kek.ksu.ru/eos/WM/AutDataCollectR.pdf
Our logic was that the first source would represent stories in the general media about data science skills, while the second source would give a more academic or technical perspective.
In order to access the RSS feed, we used the tidyRSS::tidyfeed function to read in the data.
# read in google alert RSS feed
data_science_skills <- tidyfeed("https://www.google.com/alerts/feeds/00182648300908928214/18036739630504927351", sf = TRUE)
names(data_science_skills)
## [1] "feed_title" "feed_link" "feed_last_updated"
## [4] "item_title" "item_date_updated" "item_link"
## [7] "item_content"
data_science_skills$item_link
## [1] "tag:google.com,2013:googlealerts/feed:17356835139618477013"
## [2] "tag:google.com,2013:googlealerts/feed:2436643227745363368"
## [3] "tag:google.com,2013:googlealerts/feed:6627674387286412333"
## [4] "tag:google.com,2013:googlealerts/feed:877245457118443212"
## [5] "tag:google.com,2013:googlealerts/feed:8818965960477480591"
## [6] "tag:google.com,2013:googlealerts/feed:18031683177660808970"
data_science_skills$item_content
## [1] "To remain a relevant sector in the Fourth Industrial Revolution (FIRe), persons with disabilities (PWDs) must equip themselves with <b>science</b> and ..."
## [2] "Developers want to learn the <b>data sciences</b>. They see machine learning and <b>data</b> <b>science</b> as the most important <b>skill</b> they need to learn in the year ..."
## [3] "Our graduate diploma in <b>data science</b> is designed for graduates wishing to broaden their <b>skill</b>-set enabling, them to become competent and confident ..."
## [4] "New or upgraded <b>skills</b> are needed in many key areas of work, especially those developing digital capabilities, the use of <b>data</b> and the <b>science</b> of <b>data</b>, ..."
## [5] "We have an opportunity for you to use your <b>analytical skills</b> to improve the Department of Defense's management of multiple <b>data</b> sources. You'll work ..."
## [6] "We have an opportunity for you to use your leadership and <b>analytical skills</b> to improve the Department of Defense's management of multiple <b>data</b> ..."
# load story summaries
title_vector <- c(t(data_science_skills$item_content))
title_vector
## [1] "To remain a relevant sector in the Fourth Industrial Revolution (FIRe), persons with disabilities (PWDs) must equip themselves with <b>science</b> and ..."
## [2] "Developers want to learn the <b>data sciences</b>. They see machine learning and <b>data</b> <b>science</b> as the most important <b>skill</b> they need to learn in the year ..."
## [3] "Our graduate diploma in <b>data science</b> is designed for graduates wishing to broaden their <b>skill</b>-set enabling, them to become competent and confident ..."
## [4] "New or upgraded <b>skills</b> are needed in many key areas of work, especially those developing digital capabilities, the use of <b>data</b> and the <b>science</b> of <b>data</b>, ..."
## [5] "We have an opportunity for you to use your <b>analytical skills</b> to improve the Department of Defense's management of multiple <b>data</b> sources. You'll work ..."
## [6] "We have an opportunity for you to use your leadership and <b>analytical skills</b> to improve the Department of Defense's management of multiple <b>data</b> ..."
Note that the content of the RSS feed only includes story summaries (truncated after approximately 150-180 characters), so this will limit our word counts.
Now that we have a collection of text data, we can use the tm and tau package to do the word count. We adapted code we found online (https://www.codementor.io/jhwatts2010/counting-words-with-r-ds35hzgmj) for this part of the analysis. In counting the word frequencies, we want to focus on “content” words that convey meaning, rather than words that relate to syntax, grammar, pronouns, prepositions, etc. To help do this, tm offers an option to exclude a common set of “stop words” from the word count.
# remove stop words
stop_words <- TRUE
# show list of stop words to exclude from word count
sort(sample(tm::stopwords("SMART"), 100))
## [1] "anyhow" "apart" "appropriate" "at"
## [5] "because" "before" "behind" "besides"
## [9] "between" "beyond" "both" "c'mon"
## [13] "causes" "certainly" "clearly" "co"
## [17] "com" "comes" "consequently" "during"
## [21] "edu" "eg" "etc" "except"
## [25] "following" "forth" "go" "goes"
## [29] "gotten" "h" "happens" "has"
## [33] "hence" "himself" "his" "i"
## [37] "i'd" "i'll" "inasmuch" "inc"
## [41] "inward" "is" "isn't" "it's"
## [45] "its" "keep" "kept" "lately"
## [49] "later" "latter" "least" "liked"
## [53] "likely" "many" "more" "much"
## [57] "nd" "need" "needs" "new"
## [61] "no" "nobody" "none" "novel"
## [65] "nowhere" "outside" "p" "quite"
## [69] "rather" "reasonably" "relatively" "respectively"
## [73] "said" "says" "secondly" "seeing"
## [77] "soon" "t" "t's" "tends"
## [81] "than" "thank" "thats" "the"
## [85] "thereby" "theres" "though" "truly"
## [89] "up" "upon" "was" "went"
## [93] "weren't" "what" "whatever" "where"
## [97] "with" "yes" "you've" "z"
Next we use the tm::scan_tokenizer and tau::textcnt functions to build up the word count, after excluding the stop words.
# tokenize the text
data <- tm::scan_tokenizer(title_vector)
# remove stop words
data <- if (stop_words == TRUE) tm::removeWords(data, tm::stopwords("SMART"))
# count words
data1 <- tau::textcnt(data, method = "string", n = 1L, lower = 1L)
str(data1)
## 'textcnt' Named int [1:15] 2 7 2 2 2 2 2 2 6 2 ...
## - attr(*, "names")= chr [1:15] "analytical" "data" "defense" "department" ...
## - attr(*, "useBytes")= logi FALSE
summary(data1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 2.0 2.0 2.8 2.5 7.0
First we downloaded the PDF file of the book from the internet and saved it locally in order to do the analysis. Given the size of the book and time required to download it, we’ve commented out the download.file command in the code below. We also tried to save the PDF file in GitHub and download from there, but we encountered some difficulties, probably relating to the security protocol.
# download PDF of book from open website
# download.file("http://kek.ksu.ru/eos/WM/AutDataCollectR.pdf", "./AutDataCollectR.pdf")
text <- pdf_text("./AutDataCollectR.pdf")
# OR try loading PDF from GitHub
# but this doesn't seem to work
# url <- "https://github.com/aschwenker/project3/blob/master/AutDataCollectR.pdf"
# text <- pdf_text(getURL(url))
Next we apply the same procedure as above to tokenize the data and build up the word count, after excluding the stop words.
# tokenize the data
data <- tm::scan_tokenizer(text)
# remove stop words
data <- if (stop_words == TRUE) tm::removeWords(data, tm::stopwords("SMART"))
# count words
data2 <- tau::textcnt(data, method = "string", n = 1L, lower = 1L)
str(data2)
## 'textcnt' Named int [1:5990] 2 23 2 4 3 3 3 4 4 3 ...
## - attr(*, "names")= chr [1:5990] "*" "-" "<U+2192>" "<U+22EE>" ...
## - attr(*, "useBytes")= logi FALSE
summary(data2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 5.00 15.01 12.00 1589.00
Upon reviewing the data, we notice that the result of the word count processing is a textcnt data type, not a data frame. We use the ldply function to transform data1 into a single data frame of word counts, and then rename the columns.
# transform data into data frame and rename columns
Results1 <- data1 %>% ldply(data.frame) %>% rename("Word" = 1, "Frequency" = 2) %>% arrange(desc(Frequency))
Results1
## Word Frequency
## 1 data 7
## 2 nbsp 6
## 3 science 4
## 4 skills 3
## 5 analytical 2
## 6 defense 2
## 7 department 2
## 8 improve 2
## 9 learn 2
## 10 management 2
## 11 multiple 2
## 12 opportunity 2
## 13 skill 2
## 14 we 2
## 15 work 2
In this case, because the RSS feed only gave us 5 summaries of articles, our word counts are relatively low.
Likewise, we start by converting data2 into a data frame of word counts.
# transform data into data frame and rename columns
Results2 <- data2 %>% ldply(data.frame) %>% rename("Word" = 1, "Frequency" = 2) %>% arrange(desc(Frequency))
str(Results2)
## 'data.frame': 5990 obs. of 2 variables:
## $ Word : chr "r" "data" "the" "html" ...
## $ Frequency: int 1589 1435 950 726 684 660 604 554 544 491 ...
summary(Results2)
## Word Frequency
## Length:5990 Min. : 2.00
## Class :character 1st Qu.: 2.00
## Mode :character Median : 5.00
## Mean : 15.01
## 3rd Qu.: 12.00
## Max. :1589.00
In this case, the word counts are much higher (they came from a book, after all!), so let’s filter the words based on their frequency. We choose a minimum and maximum word count, filter using the min and max word count, and then rename the columns.
# set minimum and maximum word frequency to display
min <- 200
max <- 1600
# filter based on min/max word frequency
Results2_f <- Results2 %>% filter(Frequency > min & Frequency <= max)
Results2_f
## Word Frequency
## 1 r 1589
## 2 data 1435
## 3 the 950
## 4 html 726
## 5 http 684
## 6 information 660
## 7 we 604
## 8 function 554
## 9 web 544
## 10 xml 491
## 11 text 436
## 12 file 419
## 13 str 401
## 14 in 391
## 15 table 389
## 16 document 366
## 17 url 366
## 18 list 339
## 19 code 305
## 20 content 299
## 21 a 292
## 22 server 290
## 23 node 275
## 24 this 275
## 25 extract 270
## 26 www 269
## 27 section 262
## 28 set 261
## 29 functions 258
## 30 scraping 254
## 31 files 250
## 32 request 245
## 33 package 228
## 34 xpath 226
## 35 names 215
## 36 to 212
## 37 c 211
## 38 page 211
## 39 element 210
## 40 form 208
## 41 title 201
We notice that there are still some non-content words that the “stop words” option didn’t exclude, so we filter them out here to arrive at the final data frame.
# filter out non-content words
excl <- c("the", "a", "an", "we", "in", "out", "this", "that", "to", "from")
Results2_f <- Results2_f %>% filter(!(Word %in% excl))
Results2_f
## Word Frequency
## 1 r 1589
## 2 data 1435
## 3 html 726
## 4 http 684
## 5 information 660
## 6 function 554
## 7 web 544
## 8 xml 491
## 9 text 436
## 10 file 419
## 11 str 401
## 12 table 389
## 13 document 366
## 14 url 366
## 15 list 339
## 16 code 305
## 17 content 299
## 18 server 290
## 19 node 275
## 20 extract 270
## 21 www 269
## 22 section 262
## 23 set 261
## 24 functions 258
## 25 scraping 254
## 26 files 250
## 27 request 245
## 28 package 228
## 29 xpath 226
## 30 names 215
## 31 c 211
## 32 page 211
## 33 element 210
## 34 form 208
## 35 title 201
Reviewing the Results1 data frame above, we can see that certain words are associated with data science skills while others are not. We use our (subjective) judgment to assign skills to these words and show the final result.
# assign skills to words
rss_skill <- c("Data analysis", "NA", "Scientific methods", "NA", "Analytical skills", "NA", "NA",
"NA", "Learning & curiosity", "Management skills", "NA", "NA", "NA", "NA", "NA")
rss_df <- cbind(Results1, Skill = rss_skill)
rss_df %>% filter(Skill != "NA") %>% kable(align = "rcl", caption = "Top Data Science Skills from RSS Feed")
| Word | Frequency | Skill |
|---|---|---|
| data | 7 | Data analysis |
| science | 4 | Scientific methods |
| analytical | 2 | Analytical skills |
| learn | 2 | Learning & curiosity |
| management | 2 | Management skills |
We visualize the Results2_f data frame above using a bar chart, and notice that the words “R” and “data” are clear outliers. In order to illustrate the trend for the remaining words, we filter out the outliers and then graph the data as a radar plot.
# bar chart
Results2_f %>% ggplot(aes(x = reorder(Word, Frequency), y = Frequency, fill = Word)) +
geom_bar(stat = "identity", show.legend = FALSE) + coord_flip() +
ggtitle(paste0("Word Frequency ", min, "-", max, " in ", "Textbook")) +
labs(x = NULL, y = "Frequency")
# filter out outliers and show radar plot
Results2_f %>% filter(!(Word %in% c("r", "data"))) %>%
ggplot(aes(x = Word, y = Frequency, fill = Word)) +
geom_bar(stat = "identity", show.legend = FALSE) + coord_polar(theta = "x") +
ggtitle(paste0("Word Frequency ", min, "-", max, " in ", "Textbook")) +
labs(x = NULL, y = "Frequency")
As before, we use our judgment to assign skills to the words, and then display the list of most frequent words and their implied data science skills.
# assign skills to words
book_skill <- c("R programming", "Data analysis", "Web scraping", "Web scraping", "NA", "General programming", "Web scraping", "Web scraping", "Text manipulation", "NA", "Text manipulation", "Data analysis", "Web scraping", "NA", "Data analysis", "General programming", "NA", "NA", "Web scraping", "Text manipulation", "Web scraping", "NA", "NA", "General programming", "Web scraping", "NA", "Web scraping", "General programming", "Web scraping", "R programming", "General programming", "Web scraping", "General programming", "Web scraping", "Web scraping")
book_df <- cbind(Results2_f, Skill = book_skill)
book_df %>% filter(Skill != "NA") %>% kable(align = "rcl", caption = "Top Word Counts & Implied Data Science Skills from Textbook")
| Word | Frequency | Skill |
|---|---|---|
| r | 1589 | R programming |
| data | 1435 | Data analysis |
| html | 726 | Web scraping |
| http | 684 | Web scraping |
| function | 554 | General programming |
| web | 544 | Web scraping |
| xml | 491 | Web scraping |
| text | 436 | Text manipulation |
| str | 401 | Text manipulation |
| table | 389 | Data analysis |
| document | 366 | Web scraping |
| list | 339 | Data analysis |
| code | 305 | General programming |
| node | 275 | Web scraping |
| extract | 270 | Text manipulation |
| www | 269 | Web scraping |
| functions | 258 | General programming |
| scraping | 254 | Web scraping |
| request | 245 | Web scraping |
| package | 228 | General programming |
| xpath | 226 | Web scraping |
| names | 215 | R programming |
| c | 211 | General programming |
| page | 211 | Web scraping |
| element | 210 | General programming |
| form | 208 | Web scraping |
| title | 201 | Web scraping |
We can summarize the analysis by grouping the word entries by skill, and then sorting the skills by total word count.
book_df %>% filter(Skill != "NA") %>% group_by(Skill) %>%
summarize(Total_Count = sum(Frequency)) %>% arrange(desc(Total_Count)) %>%
kable(align = "rl", caption = "Top Data Science Skills from Textbook")
| Skill | Total_Count |
|---|---|
| Web scraping | 4700 |
| Data analysis | 2163 |
| R programming | 1804 |
| General programming | 1766 |
| Text manipulation | 1107 |
From our word frequency analysis of the RSS feed and the textbook, we found that the most frequently mentioned skills are the following:
| From RSS Feed | From Textbook |
|---|---|
| 1. Data analysis | 1. Web scraping |
| 2. Scientific methods | 2. Data analysis |
| 3 (tied). Analytical skills | 3. R programming |
| 3 (tied). Learning & curiosity | 4. General programming |
| 3 (tied). Management skills | 5. Text manipulation |
This suggests that these are the most important data science skills, based on the sources we analyzed. As mentioned in the introduction, we should highlight several caveats to the analysis:
Overall, however, this seems to be a legitimate set of skills that are highly valued in the data sicence field.
We worked as a virtual team on this project. Lessons learned include the following:
We encountered several challenges as we worked on the project. Some of the challenges included:
Our suggestions for further analysis include: