The main goal of the project is to determine which skills are most valued by the employer for the field of data science. For us to identify relevant answers to this question, we decided to find out through the current posting and look for those skills that were most frequently requested and required by the employers. Since a data set with the current job postings are not available, the team decided to scrape data from an online job posting site, Indeed.com to perform our analysis. There is not a strong presence of libraries regarding R and web scarping thus the web scraping was initialized by starting a bot that could bypass the Indeed 403 error. (Code within repository)
urlclean <- "https://raw.githubusercontent.com/Sangeetha-007/Project-3-607/main/test2.csv"
df1<-read_csv(url(urlclean))
## Rows: 100 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): keyword, location, page, position, company, jobkey, jobTitle, jobDe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df<-df1%>%
select(-c(3:7))%>%
rename("Job Title" = "keyword")
df
## # A tibble: 100 × 3
## `Job Title` location jobDescription
## <chr> <chr> <chr>
## 1 Data Scientist New York "<div>\n <div>\n Candidates can be based anywhere i…
## 2 Data Scientist New York "<div>\n <ul>\n <li>Develops, validates and execute…
## 3 Data Scientist New York "<p>Description/Fundamental Components: <br> <br> De…
## 4 Data Scientist New York "<div>\n <div>\n Company\n </div> Federal Reserve B…
## 5 Data Scientist New York "<div>\n <div>\n <b>Introduction</b>\n <br> A care…
## 6 Data Scientist New York "<div>\n <p>Disney Streaming Services is responsible…
## 7 Data Scientist New York "<div>\n <p><b>Job Description:</b></p>\n <p></p>\n …
## 8 Data Scientist New York "<p></p>\n<div>\n <h2 class=\"jobSectionHeader\"><p>…
## 9 Data Scientist New York "<div>\n <p>Hi, we're Oscar. We're hiring a Data Sci…
## 10 Data Scientist New York "J.P. Morgan's CIB Technology Operating team is resp…
## # … with 90 more rows
# Hard Skills
freq_hardskills<- df %>%
mutate(R = grepl("\\bR\\b,", jobDescription)) %>%
mutate(python = grepl("Python", jobDescription, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL", jobDescription, ignore.case=TRUE)) %>%
mutate(hadoop = grepl("hadoop", jobDescription, ignore.case=TRUE)) %>%
mutate(perl = grepl("perl", jobDescription, ignore.case=TRUE)) %>%
mutate(matplotlib = grepl("matplotlib", jobDescription, ignore.case=TRUE)) %>%
mutate(Cplusplus = grepl("C++", jobDescription, fixed=TRUE)) %>%
mutate(VB = grepl("VB", jobDescription, ignore.case=TRUE)) %>%
mutate(java = grepl("java\\b", jobDescription, ignore.case=TRUE)) %>%
mutate(scala = grepl("scala", jobDescription, ignore.case=TRUE)) %>%
mutate(tensorflow = grepl("tensorflow", jobDescription, ignore.case=TRUE)) %>%
mutate(mongodb = grepl("mongodb", jobDescription, ignore.case=TRUE)) %>%
mutate(Hive = grepl("Hive", jobDescription, ignore.case=TRUE)) %>%
mutate(tableau = grepl("tableau", jobDescription, ignore.case=TRUE)) %>%
mutate("Power BI" = grepl("Power BI", jobDescription, ignore.case=TRUE)) %>%
mutate(noSQL = grepl("noSQL", jobDescription, ignore.case=TRUE)) %>%
mutate("predictive modeling" = grepl("predictive modeling", jobDescription, ignore.case=TRUE)) %>%
mutate(AWS = grepl("AWS", jobDescription, ignore.case=TRUE)) %>%
mutate(Azure = grepl("Azure", jobDescription, ignore.case=TRUE)) %>%
mutate(javascript = grepl("javascript", jobDescription, ignore.case=TRUE)) %>%
mutate(spark = grepl("spark", jobDescription, ignore.case=TRUE)) %>%
mutate("Machine Learning" = grepl("Machine Learning", jobDescription, ignore.case=TRUE)) %>%
mutate(cloud = grepl("cloud", jobDescription, ignore.case=TRUE)) %>%
mutate(masters = grepl("masters", jobDescription, ignore.case=TRUE)) %>%
mutate(statistics = grepl("statistics", jobDescription, ignore.case=TRUE)) %>%
select(`Job Title`, R, python, SQL, hadoop, perl, matplotlib, Cplusplus, VB, java, "Machine Learning", scala, tensorflow, mongodb, javascript, spark, Hive, tableau, "Power BI", noSQL,"predictive modeling", AWS, Azure, cloud, masters, statistics)
freq_hardskills
## # A tibble: 100 × 26
## Job Tit…¹ R python SQL hadoop perl matpl…² Cplus…³ VB java Machi…⁴
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 Data Sci… FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## 2 Data Sci… FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 3 Data Sci… FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 4 Data Sci… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 5 Data Sci… TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
## 6 Data Sci… FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 7 Data Sci… FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
## 8 Data Sci… FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
## 9 Data Sci… FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 Data Sci… FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## # … with 90 more rows, 15 more variables: scala <lgl>, tensorflow <lgl>,
## # mongodb <lgl>, javascript <lgl>, spark <lgl>, Hive <lgl>, tableau <lgl>,
## # `Power BI` <lgl>, noSQL <lgl>, `predictive modeling` <lgl>, AWS <lgl>,
## # Azure <lgl>, cloud <lgl>, masters <lgl>, statistics <lgl>, and abbreviated
## # variable names ¹`Job Title`, ²matplotlib, ³Cplusplus, ⁴`Machine Learning`
#Soft Skills
freq_softskills <- df%>%
mutate(remote = grepl("remote", jobDescription, ignore.case=TRUE)) %>%
mutate(communication = grepl("communicat", jobDescription, ignore.case=TRUE)) %>%
mutate(collaborative = grepl("collaborat", jobDescription, ignore.case=TRUE)) %>%
mutate(creative = grepl("creativ", jobDescription, ignore.case=TRUE)) %>%
mutate(critical = grepl("critical", jobDescription, ignore.case=TRUE)) %>%
mutate(problemsolving = grepl("problem solving", jobDescription, ignore.case=TRUE)) %>%
mutate(activelearning = grepl("active learning", jobDescription, ignore.case=TRUE)) %>%
mutate(hypothesis = grepl("hypothesis", jobDescription, ignore.case=TRUE)) %>%
mutate(organized = grepl("organize", jobDescription, ignore.case=TRUE)) %>%
mutate(judgement = grepl("judgement", jobDescription, ignore.case=TRUE)) %>%
mutate(selfstarter = grepl("self Starter", jobDescription, ignore.case=TRUE)) %>%
mutate(interpersonalskills = grepl("interpersonal skills", jobDescription, ignore.case=TRUE)) %>%
mutate(attentiontodetail = grepl("attention to detail", jobDescription, ignore.case=TRUE)) %>%
mutate(visualization = grepl("visualization", jobDescription, ignore.case=TRUE)) %>%
mutate(motivated = grepl("motivated", jobDescription, ignore.case=TRUE)) %>%
mutate(independent = grepl("independent", jobDescription, ignore.case=TRUE)) %>%
mutate(resourceful = grepl("resourceful", jobDescription, ignore.case=TRUE)) %>%
mutate(passion = grepl("passion", jobDescription, ignore.case=TRUE)) %>%
mutate(determination = grepl("determination", jobDescription, ignore.case=TRUE)) %>%
mutate(focus = grepl("focus", jobDescription, ignore.case=TRUE)) %>%
mutate(leadership = grepl("leadership", jobDescription, ignore.case=TRUE)) %>%
select(`Job Title`, remote, communication, collaborative, creative, critical, problemsolving,
activelearning, hypothesis, organized, judgement, selfstarter, interpersonalskills, attentiontodetail,
visualization, leadership, motivated, resourceful, independent, passion, determination, focus)
freq_softskills
## # A tibble: 100 × 22
## `Job Title` remote commu…¹ colla…² creat…³ criti…⁴ probl…⁵ activ…⁶ hypot…⁷
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 Data Scientist FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 2 Data Scientist FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 Data Scientist FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 Data Scientist FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## 5 Data Scientist FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## 6 Data Scientist FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
## 7 Data Scientist FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE
## 8 Data Scientist FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
## 9 Data Scientist TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 10 Data Scientist FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## # … with 90 more rows, 13 more variables: organized <lgl>, judgement <lgl>,
## # selfstarter <lgl>, interpersonalskills <lgl>, attentiontodetail <lgl>,
## # visualization <lgl>, leadership <lgl>, motivated <lgl>, resourceful <lgl>,
## # independent <lgl>, passion <lgl>, determination <lgl>, focus <lgl>, and
## # abbreviated variable names ¹communication, ²collaborative, ³creative,
## # ⁴critical, ⁵problemsolving, ⁶activelearning, ⁷hypothesis
hardskills <- freq_hardskills %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
softskills <- freq_softskills %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
hardskills
## # A tibble: 25 × 2
## skill freq
## <chr> <int>
## 1 python 86
## 2 SQL 77
## 3 Machine Learning 66
## 4 statistics 63
## 5 tableau 31
## 6 R 30
## 7 spark 25
## 8 cloud 24
## 9 AWS 22
## 10 scala 19
## # … with 15 more rows
softskills
## # A tibble: 21 × 2
## skill freq
## <chr> <int>
## 1 communication 72
## 2 collaborative 60
## 3 focus 40
## 4 passion 38
## 5 visualization 37
## 6 critical 28
## 7 creative 26
## 8 leadership 26
## 9 remote 21
## 10 independent 21
## # … with 11 more rows
Top 3 hard skills : Python, SQL, Machine Learning
Top 3 soft skills: Communicative, Collaborative, Focused
ggplot(hardskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Hard Skills') + coord_flip() + theme_minimal()
ggplot(softskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="purple") + xlab('') + ylab('Frequency') + labs(title='Soft Skills') + coord_flip() + theme_minimal()
ggplot(hardskills, aes(x=skill, y=freq))+
geom_point()+
theme(axis.text.x = element_text(angle = 90))
ggplot(softskills, aes(x=skill, y=freq))+
geom_point()+
theme(axis.text.x = element_text(angle = 90))
Our project demonstrates that skills in Python, SQL and Machine Learning are among the top skills required to be considered as an ideal candidate for a Data Scientist roles. This is significant because while these are the core skills, their are various other hard skills that could be utilized to set someone apart from the fellow candidates.
It is also Important to note that there are also the soft skills of being communicative, collaborative and focused that also bring an added benefit towards a candidate.
Another important thing to note is that the magnitude of hard skills outweighs the soft skills and thus can demonstrate that more importance is placed on candidates abilities to accurately draw insights from data.
Enjoy!
wordcloud2(hardskills, size = 0.7)