When searching indeed.com the resulting URL includes all of the criteria specified in your search. This approach concatentates search result URLs with different specifications. We combined a skill we were interested in with the term ‘data scientist’, specified the city of interest, and the radius of 50 miles. We then iterated through the URLs we created and collected the number of job postings (count) for each different search.
For our second approach we decided focus in on one location, New York, NY. We also approached scraping indeed slightly differently. Instead of changing the parameters of the search result URL, we kept the parameters the same, searching only for ‘data scientist’ in New York, NY. The search result page displays 10 results per page so we created URLs for each sequential page of the search results by increasing the ‘start’ parameter by 10.
Storing data in database
After the data scraping data was initially stored in the csv file and then later loaded into mysql database. Initial idea was to load the data in cloud database (db4free.net). But we faced some issues with the mysql version hosted on this cloud and ended up storing data in local database. We kept the flexibility of reading data from local csv file as well as database.
Storing data in database - code
Visualization and analysis
indeed_skillaggr<-aggregate(read_indeed_url$Count,by=list(Category=read_indeed_url$Skills), FUN=sum)
indeed_skillaggr
## Category x
## 1 Big Data 7684
## 2 Communication 13615
## 3 Data Mining 2262
## 4 Hadoop 3507
## 5 Hive 1533
## 6 Machine Learning 7640
## 7 MapReduce 522
## 8 MongoDB 420
## 9 Neural network 880
## 10 NLP 928
## 11 Predictive Analysis 1850
## 12 Python 8886
## 13 R 4624
## 14 SAS 5476
## 15 Tensorflow 1101
plots_top<-tail(skills_count,15)
darkcols <- brewer.pal(8,"Dark2")
names <- plots_top$Skills
barplot(plots_top$Total,main="Indeed Skill Demand", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)
top10_skills<-skills_city[1:10,]
ggplot(top10_skills, aes(x=Skills, y=Total, colour= City, size = Total)) + ggtitle("Citywise Skill Demand")+geom_point()
library(wordcloud)
wordcloud(skills_count$Skills,skills_count$Total, random.order=FALSE, colors=brewer.pal(8,"Dark2"))
ny_indeed$key_words <- factor(ny_indeed$key_words, levels = unique(ny_indeed$key_words)[order(ny_indeed$count, decreasing = F)])
m <- list(
l = 100,
r = 100,
b = 100,
t = 100,
pad = 4
)
key_word_plot <- plot_ly(data = ny_indeed, x= ~count, y = ~key_words, type = 'bar', orientation = 'h', color = ~type) %>%
layout(title='Skills Required of Data Scientists in NY')
key_word_plot
grpd$type <- factor(grpd$type, levels = unique(grpd$type)[order(grpd$sum_by_type, decreasing = F)])
sum_by_type <- plot_ly(data = grpd, x=~sum_by_type, y=~type, type = 'bar', orientation = 'h', color = ~type) %>%
layout(title='NY Skills by Type')
sum_by_type
RPubs Location of Data Analytics file
Conclusions, lessons learned and possible enhancements
Conclusion:
Our findings show that many skills are required of a Data Scientist. We learned that some of the top hard skills required are Python, Machine Learning, Big Data, SQL, Excel, and R. As for soft skills, a Data Scientist is expected to communicate and have managerial experience. Mathemetics or math was also the most frequently mentioned key word in all of NY Data Scientist job postings. However, from our NY data we cannot definitively conclude whether one type of skill is significantly more important than another. Even though we have counted more mentions of hard skills than soft or education requirements, our search included more key words for hard skills than either of the other types, so this result should be expected. Most importantly, we learned that a Data Scientist is required to be well rounded, with a strong higher education and both soft and hard skills to ensure they can get the job done.
Following are the lesson leanred while working on this project:
Future Enhancements:
GitHub Location of Web Scrappingfile
GitHub Location of Data Analytics file
GitHub location of Database Transformationfile
GitHub location of Database Transformation file
GitHub location of Final Presentation file
RPubs Location of Web Scrapping file
RPubs Location of Data Analytics file