Nick’s contribution
Libraries used
library(tidyverse)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(reshape2)
First we imported the dataframe into R using the readLines command and stored the data in the variable “url”.
url <- readLines("https://raw.githubusercontent.com/WigodskyD/data-sets/master/SF_cleanjobfiles")
The data was then compined into one string, which made it easier to search the text for keywords located in the dataset.
review_text <- paste(url, collapse=" ")
We then used the vectorsource function which places our text into a vector, and then load the data into a variable nammed corpus.
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
Cleans the actual data. The data was then converted to all lower case - as to not leave out any text due to capitilization, removed any additional white space, removed stop words (the, a, etc.) and removed words we decided were not needed for our analysis. We used the tm_map function located within the “tm” (text mining) package.
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("andor", "status well", "within",
"business", "learning", "field",
"span", "amp", "across", "strong", "large", "using",
"help", "clients", "class", "classresultlinkbarcontainer",
"every", "work", "can", "position", "risk", "global", "work",
"will","brbr", "resultlinkbarviewjob", "years",
"idjobsummary","div","new", "skills"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
We then created a frequency table to determine what the most used words were in the dataset (after our cleaned version).
The data shows that from the entire document, the following words were used the most:
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
table <- head(frequency, 20)
kable(table)
x | |
---|---|
data | 2779 |
experience | 1311 |
team | 760 |
machine | 700 |
summary | 657 |
result | 653 |
bar | 645 |
link | 644 |
science | 607 |
job | 509 |
research | 459 |
analysis | 445 |
product | 444 |
analytics | 390 |
models | 351 |
rsquo | 340 |
engineering | 338 |
build | 336 |
ability | 332 |
company | 330 |
words <- names(frequency)
wordcloud(words[1:100], frequency[1:100],
colors=brewer.pal(8, "Dark2"))
wf <- data.frame(word=names(frequency), frequency=frequency)
p <- ggplot(subset(wf, frequency>200), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
p
From the entire indeed listing for San Fransisco, It seems that the keywords: data, experience, and team are the top three skills in job listings.
Lets filter our data some more - only looking at specific keywords that we identified from our Glassdoor Article as mentioned above.
The following code will make the TM package only search for what we tell it (our Data Scientist skills):
review_text <- paste(url, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
keep = c("python", "sql", "hadoop", "statistics", "tableau", "Java", "Machine",
"geo", "jupyter notebook", "sas", "scala", "matlab", "big data", "database", "data mining",
"collaboration", "unsupervised", "visulization", "sra", "svm", "hdfs", "linux", "map reduce",
"pig", "decision forests", "phd","masters", "bachelors", "r", "c")
We then run code similr to earlier, in order to clean our data:
keepOnlyWords<-content_transformer(function(x,words) {
regmatches(x,
gregexpr(paste0("(\\b", paste(words, collapse = "\\b|\\b"), "\\b)"), x)
, invert = T) <- " "
x
})
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, keepOnlyWords, keep)
We then create a frequency table showing the new filtered data:
dtm <- DocumentTermMatrix(corpus, control = list(wordLengths = c(1, Inf)))
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
kable(frequency)
x | |
---|---|
python | 292 |
statistics | 252 |
r | 201 |
data | 200 |
sql | 183 |
phd | 121 |
big | 112 |
mining | 88 |
hadoop | 87 |
c | 81 |
scala | 58 |
collaboration | 49 |
database | 41 |
sas | 41 |
tableau | 34 |
masters | 31 |
matlab | 30 |
linux | 21 |
pig | 18 |
unsupervised | 15 |
bachelors | 8 |
hdfs | 6 |
svm | 6 |
map | 4 |
reduce | 4 |
decision | 2 |
forests | 2 |
jupyter | 2 |
notebook | 2 |
geo | 1 |
wf <- data.frame(word=names(frequency), frequency=frequency)
dataset1 <- wf
p2 <- ggplot(subset(wf, frequency>50), aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", aes(fill= reorder(word, -frequency))) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(legend.position="none")
p2
library(fpc)
## Warning: package 'fpc' was built under R version 3.4.4
library(cluster)
d <- dist(t(dtm), method="euclidian")
fit <- hclust(d=d, method="complete")
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=6)
rect.hclust(fit, k=6, border="red")
d <- dist(t(dtm), method="euclidian")
kfit <- kmeans(d, 4)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
Looking at the filtered data (tables and visualizations), it helps give us, as future Data Scientists, a deeper understanding of some of the top skills that are needed. It seems that Data Scientists need to know different programing languages (Python, R, SQL). Data Scientists also need to understand statistics. This is also shown in our clustering diagrams - analysis, python, statistics, data, and SQL are the most related when compared to all of the other skills.
The above data looks at skills found on the West coast of the United States.
Below we’re going to take a look at the skills needed on the east coast, and later compare.
The approach is the same as above regarding the code, so to save space the code will only show needed outputs.
The data is taken from Whiteplains, New York
url <- readLines("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-607/master/cleanjobfiles.txt")
x | |
---|---|
data | 2174 |
experience | 1316 |
research | 837 |
summary | 651 |
bar | 607 |
result | 607 |
link | 603 |
team | 601 |
analysis | 489 |
job | 472 |
development | 460 |
analytics | 452 |
science | 427 |
ability | 402 |
including | 363 |
management | 352 |
models | 345 |
rsquo | 342 |
machine | 332 |
regeneron | 318 |
Create word cloud
Looking at this data for the east coast, it seems the top mentioned keywords in the entire dataset are: data and experience.
Lets filter this data by the same skills mentioned in our previous dataset.
x | |
---|---|
python | 185 |
r | 178 |
statistics | 176 |
data | 161 |
sql | 136 |
big | 101 |
phd | 95 |
sas | 81 |
database | 79 |
hadoop | 77 |
mining | 60 |
collaboration | 58 |
c | 57 |
scala | 44 |
tableau | 38 |
matlab | 26 |
masters | 24 |
linux | 23 |
pig | 14 |
unsupervised | 8 |
svm | 7 |
bachelors | 4 |
hdfs | 3 |
geo | 2 |
map | 2 |
reduce | 2 |
jupyter | 1 |
notebook | 1 |
sra | 1 |
Looking at the filtered data, it seems that Data Scientists need to know different programing languages, and know different programing languages (Python and R are the top mentioned). Data Scientists also need to understand statistics.
This is also shown in our clustering diagrams - analysis, python, statistics, and data, are the most related when compared to all of the other skills. (Notice how SQL is less related in WP, NY than in SF, CA.)
Now that we have both datasets, lets see how they compare to each other:
Comparsion of the two datasets - White Plains, NY and San Francisco, CA by frequency of skills:
compare_data <- merge(dataset1,dataset2, by.x = "word", by.y = "word")
compare_data1 <- compare_data %>%
rename(SF = frequency.x, WP = frequency.y) %>%
mutate(Difference = abs(compare_data$frequency.x - compare_data$frequency.y))
compare_data1[1:25, ] %>%
mutate_if(is.numeric, function(x) {
cell_spec(x, "html", bold = T, color = spec_color(x, end = 0.9),
font_size = spec_font_size(x))
}) %>%
mutate(word = cell_spec(
word, "html", color = "white", bold = T,
background = spec_color(1:10, end = 0.9, option = "A", direction = -1)
)) %>%
kable("html", escape = F, align = "c") %>%
kable_styling("striped", full_width = T)
word | SF | WP | Difference |
---|---|---|---|
bachelors | 8 | 4 | 4 |
big | 112 | 101 | 11 |
c | 81 | 57 | 24 |
collaboration | 49 | 58 | 9 |
data | 200 | 161 | 39 |
database | 41 | 79 | 38 |
geo | 1 | 2 | 1 |
hadoop | 87 | 77 | 10 |
hdfs | 6 | 3 | 3 |
jupyter | 2 | 1 | 1 |
linux | 21 | 23 | 2 |
map | 4 | 2 | 2 |
masters | 31 | 24 | 7 |
matlab | 30 | 26 | 4 |
mining | 88 | 60 | 28 |
notebook | 2 | 1 | 1 |
phd | 121 | 95 | 26 |
pig | 18 | 14 | 4 |
python | 292 | 185 | 107 |
r | 201 | 178 | 23 |
reduce | 4 | 2 | 2 |
sas | 41 | 81 | 40 |
scala | 58 | 44 | 14 |
sql | 183 | 136 | 47 |
statistics | 252 | 176 | 76 |
The following code reshapes data into long format using melt function from the reshape2 package. Then compare datasets from San Fransisco and White plains (west vs. east coast) by count of skills
library(reshape2)
long <- melt(compare_data, id=c("word"))
ggplot(long) +
geom_bar(aes(x = word, y = value, fill = variable),
stat="identity", position = "dodge", width = 0.7) +
scale_fill_manual("Result\n", values = c("red","blue"),
labels = c("SF", " WP")) +
labs(x="\nWord",y="Result\n") +
theme_bw(base_size = 12) +
coord_flip()
Comparison of San Francisco and Whiteplains.
Conclusions from comparison:
It seems that the data from San Francisco, CA has more keywords that relate to technical skills - SQL (183 vs. 136), Python (292 vs. 185), R (201 vs. 178) and want a candidate with an advanced degree: phd (121 vs. 95), masters (31 vs. 24).
Looking at the entire dataset, it seems that both areas want Data Scientists with experience, data, and analysis skills.