The main goal of the project is to determine which skills are most valued by the employer for the field of data science. For us to identify relevant answers to this question, we decided to find out through the current posting and look for those skills that were most frequently requested and required by the employers. Since a data set with the current job postings are not available, the team decided to scrape data from an online job posting site, Indeed.com to perform our analysis. There is not a strong presence of libraries regarding R and web scarping thus the web scraping was initialized by starting a bot that could bypass the Indeed 403 error. (Code within repository)

Clean Data

urlclean <- "https://raw.githubusercontent.com/Sangeetha-007/Project-3-607/main/test2.csv"
df1<-read_csv(url(urlclean))
## Rows: 100 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): keyword, location, page, position, company, jobkey, jobTitle, jobDe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df<-df1%>%
  select(-c(3:7))%>%
  rename("Job Title" = "keyword")
df
## # A tibble: 100 × 3
##    `Job Title`    location jobDescription                                       
##    <chr>          <chr>    <chr>                                                
##  1 Data Scientist New York "<div>\n <div>\n  Candidates can be based anywhere i…
##  2 Data Scientist New York "<div>\n <ul>\n  <li>Develops, validates and execute…
##  3 Data Scientist New York "<p>Description/Fundamental Components: <br> <br> De…
##  4 Data Scientist New York "<div>\n <div>\n  Company\n </div> Federal Reserve B…
##  5 Data Scientist New York "<div>\n <div>\n  <b>Introduction</b>\n  <br> A care…
##  6 Data Scientist New York "<div>\n <p>Disney Streaming Services is responsible…
##  7 Data Scientist New York "<div>\n <p><b>Job Description:</b></p>\n <p></p>\n …
##  8 Data Scientist New York "<p></p>\n<div>\n <h2 class=\"jobSectionHeader\"><p>…
##  9 Data Scientist New York "<div>\n <p>Hi, we're Oscar. We're hiring a Data Sci…
## 10 Data Scientist New York "J.P. Morgan's CIB Technology Operating team is resp…
## # … with 90 more rows

We know that within our model we want to explore the various known hard and soft skills needed for the roles

# Hard Skills
freq_hardskills<- df %>%
    mutate(R = grepl("\\bR\\b,", jobDescription)) %>%
    mutate(python = grepl("Python", jobDescription, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL", jobDescription, ignore.case=TRUE)) %>%
    mutate(hadoop = grepl("hadoop", jobDescription, ignore.case=TRUE)) %>%
    mutate(perl = grepl("perl", jobDescription, ignore.case=TRUE)) %>%
    mutate(matplotlib = grepl("matplotlib", jobDescription, ignore.case=TRUE)) %>%
    mutate(Cplusplus = grepl("C++", jobDescription, fixed=TRUE)) %>%
    mutate(VB = grepl("VB", jobDescription, ignore.case=TRUE)) %>%
    mutate(java = grepl("java\\b", jobDescription, ignore.case=TRUE)) %>%
    mutate(scala = grepl("scala", jobDescription, ignore.case=TRUE)) %>%
    mutate(tensorflow = grepl("tensorflow", jobDescription, ignore.case=TRUE)) %>%
    mutate(mongodb = grepl("mongodb", jobDescription, ignore.case=TRUE)) %>%
    mutate(Hive = grepl("Hive", jobDescription, ignore.case=TRUE)) %>%
    mutate(tableau = grepl("tableau", jobDescription, ignore.case=TRUE)) %>%
    mutate("Power BI" = grepl("Power BI", jobDescription, ignore.case=TRUE)) %>%
    mutate(noSQL = grepl("noSQL", jobDescription, ignore.case=TRUE)) %>%
    mutate("predictive modeling" = grepl("predictive modeling", jobDescription, ignore.case=TRUE)) %>%
    mutate(AWS = grepl("AWS", jobDescription, ignore.case=TRUE)) %>%
    mutate(Azure = grepl("Azure", jobDescription, ignore.case=TRUE)) %>%
    mutate(javascript = grepl("javascript", jobDescription, ignore.case=TRUE)) %>%
    mutate(spark = grepl("spark", jobDescription, ignore.case=TRUE)) %>%
    mutate("Machine Learning" = grepl("Machine Learning", jobDescription, ignore.case=TRUE)) %>%
    mutate(cloud = grepl("cloud", jobDescription, ignore.case=TRUE)) %>%
    mutate(masters = grepl("masters", jobDescription, ignore.case=TRUE)) %>%
    mutate(statistics = grepl("statistics", jobDescription, ignore.case=TRUE)) %>%
  
select(`Job Title`, R, python, SQL, hadoop, perl, matplotlib, Cplusplus, VB, java, "Machine Learning", scala, tensorflow, mongodb, javascript, spark, Hive, tableau, "Power BI", noSQL,"predictive modeling", AWS, Azure, cloud, masters, statistics)
freq_hardskills
## # A tibble: 100 × 26
##    Job Tit…¹ R     python SQL   hadoop perl  matpl…² Cplus…³ VB    java  Machi…⁴
##    <chr>     <lgl> <lgl>  <lgl> <lgl>  <lgl> <lgl>   <lgl>   <lgl> <lgl> <lgl>  
##  1 Data Sci… FALSE TRUE   TRUE  TRUE   FALSE FALSE   FALSE   FALSE FALSE TRUE   
##  2 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE FALSE   FALSE   FALSE FALSE TRUE   
##  3 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE FALSE   FALSE   FALSE FALSE TRUE   
##  4 Data Sci… FALSE FALSE  FALSE FALSE  FALSE FALSE   FALSE   FALSE FALSE TRUE   
##  5 Data Sci… TRUE  TRUE   TRUE  TRUE   FALSE FALSE   FALSE   FALSE TRUE  TRUE   
##  6 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE FALSE   FALSE   FALSE FALSE TRUE   
##  7 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE TRUE    FALSE   FALSE FALSE TRUE   
##  8 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE FALSE   FALSE   TRUE  TRUE  TRUE   
##  9 Data Sci… FALSE TRUE   TRUE  FALSE  FALSE FALSE   FALSE   FALSE FALSE FALSE  
## 10 Data Sci… FALSE FALSE  FALSE FALSE  FALSE FALSE   FALSE   FALSE FALSE FALSE  
## # … with 90 more rows, 15 more variables: scala <lgl>, tensorflow <lgl>,
## #   mongodb <lgl>, javascript <lgl>, spark <lgl>, Hive <lgl>, tableau <lgl>,
## #   `Power BI` <lgl>, noSQL <lgl>, `predictive modeling` <lgl>, AWS <lgl>,
## #   Azure <lgl>, cloud <lgl>, masters <lgl>, statistics <lgl>, and abbreviated
## #   variable names ¹​`Job Title`, ²​matplotlib, ³​Cplusplus, ⁴​`Machine Learning`
#Soft Skills
freq_softskills <- df%>%
    mutate(remote = grepl("remote", jobDescription, ignore.case=TRUE)) %>%
    mutate(communication = grepl("communicat", jobDescription, ignore.case=TRUE)) %>%
    mutate(collaborative = grepl("collaborat", jobDescription, ignore.case=TRUE)) %>%
    mutate(creative = grepl("creativ", jobDescription, ignore.case=TRUE)) %>%
    mutate(critical = grepl("critical", jobDescription, ignore.case=TRUE)) %>%
    mutate(problemsolving = grepl("problem solving", jobDescription, ignore.case=TRUE)) %>%
    mutate(activelearning = grepl("active learning", jobDescription, ignore.case=TRUE)) %>%
    mutate(hypothesis = grepl("hypothesis", jobDescription, ignore.case=TRUE)) %>%
    mutate(organized = grepl("organize", jobDescription, ignore.case=TRUE)) %>%
    mutate(judgement = grepl("judgement", jobDescription, ignore.case=TRUE)) %>%
    mutate(selfstarter = grepl("self Starter", jobDescription, ignore.case=TRUE)) %>%
    mutate(interpersonalskills = grepl("interpersonal skills", jobDescription, ignore.case=TRUE)) %>%
    mutate(attentiontodetail = grepl("attention to detail", jobDescription, ignore.case=TRUE)) %>%
    mutate(visualization = grepl("visualization", jobDescription, ignore.case=TRUE)) %>%
    mutate(motivated = grepl("motivated", jobDescription, ignore.case=TRUE)) %>%
    mutate(independent = grepl("independent", jobDescription, ignore.case=TRUE)) %>%
    mutate(resourceful = grepl("resourceful", jobDescription, ignore.case=TRUE)) %>%
    mutate(passion = grepl("passion", jobDescription, ignore.case=TRUE)) %>%
    mutate(determination = grepl("determination", jobDescription, ignore.case=TRUE)) %>%
    mutate(focus = grepl("focus", jobDescription, ignore.case=TRUE)) %>%
    mutate(leadership = grepl("leadership", jobDescription, ignore.case=TRUE)) %>%

  select(`Job Title`, remote, communication, collaborative, creative, critical, problemsolving, 
  activelearning, hypothesis, organized, judgement, selfstarter, interpersonalskills, attentiontodetail, 
  visualization, leadership, motivated, resourceful, independent, passion, determination, focus)
freq_softskills
## # A tibble: 100 × 22
##    `Job Title`    remote commu…¹ colla…² creat…³ criti…⁴ probl…⁵ activ…⁶ hypot…⁷
##    <chr>          <lgl>  <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>  
##  1 Data Scientist FALSE  TRUE    TRUE    TRUE    TRUE    FALSE   FALSE   FALSE  
##  2 Data Scientist FALSE  TRUE    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
##  3 Data Scientist FALSE  TRUE    FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
##  4 Data Scientist FALSE  TRUE    TRUE    FALSE   TRUE    FALSE   FALSE   FALSE  
##  5 Data Scientist FALSE  TRUE    TRUE    TRUE    FALSE   FALSE   FALSE   FALSE  
##  6 Data Scientist FALSE  TRUE    TRUE    FALSE   FALSE   FALSE   FALSE   FALSE  
##  7 Data Scientist FALSE  TRUE    TRUE    FALSE   FALSE   FALSE   FALSE   TRUE   
##  8 Data Scientist FALSE  TRUE    TRUE    FALSE   TRUE    FALSE   FALSE   FALSE  
##  9 Data Scientist TRUE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
## 10 Data Scientist FALSE  FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE  
## # … with 90 more rows, 13 more variables: organized <lgl>, judgement <lgl>,
## #   selfstarter <lgl>, interpersonalskills <lgl>, attentiontodetail <lgl>,
## #   visualization <lgl>, leadership <lgl>, motivated <lgl>, resourceful <lgl>,
## #   independent <lgl>, passion <lgl>, determination <lgl>, focus <lgl>, and
## #   abbreviated variable names ¹​communication, ²​collaborative, ³​creative,
## #   ⁴​critical, ⁵​problemsolving, ⁶​activelearning, ⁷​hypothesis

Analysis

A summary of the various Hard and Soft skills needs to be established in order for the following visualization tool to utilize the data

hardskills <- freq_hardskills %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

softskills <- freq_softskills %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))




hardskills
## # A tibble: 25 × 2
##    skill             freq
##    <chr>            <int>
##  1 python              86
##  2 SQL                 77
##  3 Machine Learning    66
##  4 statistics          63
##  5 tableau             31
##  6 R                   30
##  7 spark               25
##  8 cloud               24
##  9 AWS                 22
## 10 scala               19
## # … with 15 more rows
softskills
## # A tibble: 21 × 2
##    skill          freq
##    <chr>         <int>
##  1 communication    72
##  2 collaborative    60
##  3 focus            40
##  4 passion          38
##  5 visualization    37
##  6 critical         28
##  7 creative         26
##  8 leadership       26
##  9 remote           21
## 10 independent      21
## # … with 11 more rows

Top 3 hard skills : Python, SQL, Machine Learning
Top 3 soft skills: Communicative, Collaborative, Focused

Visualization of Soft and Hard Skills

ggplot(hardskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Hard Skills') + coord_flip() + theme_minimal()

ggplot(softskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="purple") + xlab('') + ylab('Frequency') + labs(title='Soft Skills') + coord_flip() + theme_minimal()

ggplot(hardskills, aes(x=skill, y=freq))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90))

ggplot(softskills, aes(x=skill, y=freq))+
  geom_point()+
  theme(axis.text.x = element_text(angle = 90))

Conclusion

Our project demonstrates that skills in Python, SQL and Machine Learning are among the top skills required to be considered as an ideal candidate for a Data Scientist roles. This is significant because while these are the core skills, their are various other hard skills that could be utilized to set someone apart from the fellow candidates.

It is also Important to note that there are also the soft skills of being communicative, collaborative and focused that also bring an added benefit towards a candidate.

Another important thing to note is that the magnitude of hard skills outweighs the soft skills and thus can demonstrate that more importance is placed on candidates abilities to accurately draw insights from data.

Enjoy!

wordcloud2(hardskills, size = 0.7)