Loading libraries:

Introduction:

The goal of this project is to find and use data to answer the question “Which are the most valued data science. Our group members are: Carol Campbell, Kossi Akplaka, Souleymane Doumbia and Saloua Daouki.

We began by searching various websites for suitable data before finally settling on a Kaggle dataset by Luke Barousse, entitled Data Analyst Skill Analysis. This file was then uploaded into our Github Project 3 repository as “Data_Science_job.csv.”

Loading data from Github repository:

url3 <- 'https://raw.githubusercontent.com/Group-Project-Data-607/Project3/main/Data_Science_job.csv'
df <- read.csv(file = url3)

tibble (df)

## # A tibble: 1,820 × 43
##        X index title      company_name location Job.posting.location description
##    <int> <int> <chr>      <chr>        <chr>    <chr>                <chr>      
##  1     0     0 Enterpris… Children’s … Kansas … via AARP Job Board   "Thanks fo…
##  2     1     1 Senior Co… Harnham      United … via WJHL Jobs        "Senior Co…
##  3     2     2 Business … Lockton      Kansas … via Lockton - Talen… "We are lo…
##  4     3     3 PowerBI D… Mitchell Ma… United … via Mitchell Martin… "Our clien…
##  5     4     4 Sr SaaS B… Cyber Resou… United … via Central Illinoi… "Job Title…
##  6     5     5 Sr. Syste… General Dyn… Anywhere via Clearance Jobs   "REQ#: RQ1…
##  7     6     6 Sr. Data … CRB          Kansas … via Smart Recruiter… "Company D…
##  8     7     7 Senior An… Ada          United … via Simplify Jobs    "Ada was f…
##  9     8     8 Data De-I… PwC          United … via Mendeley         "• *Specia…
## 10     9     9 Data Anal… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 36 more variables: extensions <chr>, job_id <chr>, thumbnail <chr>,
## #   posted_at <chr>, schedule_type <chr>, work_from_home <lgl>, salary <chr>,
## #   search_term <chr>, date_time <chr>, search_location <chr>,
## #   commute_time <lgl>, salary_pay <chr>, salary_rate <chr>, salary_avg <dbl>,
## #   salary_min <dbl>, salary_max <dbl>, salary_hourly <dbl>,
## #   salary_yearly <dbl>, salary_standardized <dbl>, Required.skills <chr>, …

Tidying the data:

This dataset contained many elements that we deemed necessary for our analysis - a variety of job listings from various companies from all over country, each listing desired skills for the ideal candidate. While ideal from a data entry standpoint, a major challenge that we encountered was that the data was far from tidy. There were multiple skills listed in one variable, vague job titles that did not match the job description, and way too many variables for the task at hand. Thus let the tidying begin.

Tidying the title column

The data included job descriptions that were vague and made it difficult to ascertain the true nature of the job posting. Therefore, we took the liberty of creating job titles based on their respective descriptions, and decided to group the jobs as “Data Scientist”, “Data Engineer”, “Data Analyst”.

For this following job title, I’ll look for this following keywords in the description column:

data science: Machine learning, modeling, A/B testing
data engineer: ETL, pipelines, warehouse, architect
data analyst: All other

If a job description contains both “data analyst” and “data science” keywords, the code will categorize it based on the first condition that matches.

skills <- df %>%
  mutate(new_title = case_when(
    grepl("ETL|pipeline|warehouse|architect", description, ignore.case = TRUE) ~ "Data Engineer",
    grepl("machine learning | modeling|A/B testing", description, ignore.case = TRUE) ~ "Data Science",
    TRUE ~ "Data Analyst"
  )) 

tibble (skills)

## # A tibble: 1,820 × 44
##        X index title      company_name location Job.posting.location description
##    <int> <int> <chr>      <chr>        <chr>    <chr>                <chr>      
##  1     0     0 Enterpris… Children’s … Kansas … via AARP Job Board   "Thanks fo…
##  2     1     1 Senior Co… Harnham      United … via WJHL Jobs        "Senior Co…
##  3     2     2 Business … Lockton      Kansas … via Lockton - Talen… "We are lo…
##  4     3     3 PowerBI D… Mitchell Ma… United … via Mitchell Martin… "Our clien…
##  5     4     4 Sr SaaS B… Cyber Resou… United … via Central Illinoi… "Job Title…
##  6     5     5 Sr. Syste… General Dyn… Anywhere via Clearance Jobs   "REQ#: RQ1…
##  7     6     6 Sr. Data … CRB          Kansas … via Smart Recruiter… "Company D…
##  8     7     7 Senior An… Ada          United … via Simplify Jobs    "Ada was f…
##  9     8     8 Data De-I… PwC          United … via Mendeley         "• *Specia…
## 10     9     9 Data Anal… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 37 more variables: extensions <chr>, job_id <chr>, thumbnail <chr>,
## #   posted_at <chr>, schedule_type <chr>, work_from_home <lgl>, salary <chr>,
## #   search_term <chr>, date_time <chr>, search_location <chr>,
## #   commute_time <lgl>, salary_pay <chr>, salary_rate <chr>, salary_avg <dbl>,
## #   salary_min <dbl>, salary_max <dbl>, salary_hourly <dbl>,
## #   salary_yearly <dbl>, salary_standardized <dbl>, Required.skills <chr>, …

Here we parsed out only the columns necessary for our analysis, reducing 43 columns/variables to 8 columns/variables.

required_skills <- subset(skills, select = c('new_title', 'title', 'company_name', 'location', 'Job.posting.location', 'description', 'schedule_type', 'Required.skills'))

#rename required_skills column to skills
colnames(required_skills)[8] ="skills"

tibble (required_skills)

## # A tibble: 1,820 × 8
##    new_title     title    company_name location Job.posting.location description
##    <chr>         <chr>    <chr>        <chr>    <chr>                <chr>      
##  1 Data Analyst  Enterpr… Children’s … Kansas … via AARP Job Board   "Thanks fo…
##  2 Data Science  Senior … Harnham      United … via WJHL Jobs        "Senior Co…
##  3 Data Science  Busines… Lockton      Kansas … via Lockton - Talen… "We are lo…
##  4 Data Analyst  PowerBI… Mitchell Ma… United … via Mitchell Martin… "Our clien…
##  5 Data Engineer Sr SaaS… Cyber Resou… United … via Central Illinoi… "Job Title…
##  6 Data Engineer Sr. Sys… General Dyn… Anywhere via Clearance Jobs   "REQ#: RQ1…
##  7 Data Engineer Sr. Dat… CRB          Kansas … via Smart Recruiter… "Company D…
##  8 Data Engineer Senior … Ada          United … via Simplify Jobs    "Ada was f…
##  9 Data Engineer Data De… PwC          United … via Mendeley         "• *Specia…
## 10 Data Analyst  Data An… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 2 more variables: schedule_type <chr>, skills <chr>

Remove the special characters in from the ‘skills’ column for downstream analysis

#this code taken from: https://rdrr.io/github/YangWu1227/citizenr/src/R/clearning-helpers.R

#function to remove special characters across dataframe

rm_spl_char <- function(df, var) {
  if (is.data.frame(df) == FALSE) {
    stop("'df' must be a data frame", call. = FALSE)
  }
  if (!is_character(var) | vec_size(var) > length(df)) {
    stop("'var' must be a character vector no greater than length(df)", call. = FALSE)
  }
  if (!all(var %in% names(df))) {
    stop("'var' must be columns found in 'df'", call. = FALSE)
  }
}

# remove special characters from skills column
required_skills[8] <- required_skills[8] %>%
    mutate(
      across(
        .cols = everything(),
        .fns = ~ str_replace_all(.x, "[^[:alnum:]^.]", " ")
      )
    )


tibble(required_skills[8])

## # A tibble: 1,820 × 1
##    skills                                                                       
##    <chr>                                                                        
##  1 "  "                                                                         
##  2 "  sql    excel    python  "                                                 
##  3 "  r    sql    azure    python  "                                            
##  4 "  power bi  "                                                               
##  5 "  sql    jira  "                                                            
##  6 "  python    tableau    r    sql    excel  "                                 
##  7 "  power bi    ssis    python    tableau    dax    powershell    r    git   …
##  8 "  go    spark    c    sql  "                                                
##  9 "  python    excel  "                                                        
## 10 "  spss    tableau    excel  "                                               
## # ℹ 1,810 more rows

Data Jobs

We were able to determine that of the three types of jobs in this dataset, Data Analyst was the most sought after with 1012 postings, followed by Data Engineer with 481 postings, then Data Science with 327 postings.

## Count of the different title
table(required_skills$new_title)

## 
##  Data Analyst Data Engineer  Data Science 
##          1012           481           327

Hardskills

There are a number of technological and data visualization skills that employers consider essential for data scientist candidates. Also known as “hard kills”, these requirements include competence in the latest software and statistical analysis tools, strong mathematical, quantitative and analytical skills. Here, our next task was filter and group by like software skills in preparation for analysis.

freq_hardskills<- required_skills %>%
    mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
    mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
    mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet", skills, ignore.case=TRUE)) %>%
    mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
    mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
    mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
    mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
    mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
    mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
    mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
    mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
    mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
    mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
    mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
    mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
    mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
  select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)

While “hard skills” are important to prospective employers, strong “soft skills” are also necessary for optimal career performance. Also known as “transferable skills”, soft skills include the ability to effectively communicate across and within their respective teams, with vendors and other stakeholders. Leadership, time management and problem solving are other soft skills that are often sought after. Lastly, one must also be able to present findings in non-technical terms, easily understood by a universal audience.

Soft skills analysis

freq_softskills<- required_skills %>%
    mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
    mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
    mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
    mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
    mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
    mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
    mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
    mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%      
    mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
    mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
  select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)

Data Jobs Analysis:

Frequency analysis of hard skills

hardskills <- freq_hardskills %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble(hardskills)

## # A tibble: 17 × 2
##    skill             freq
##    <chr>            <int>
##  1 R                 1021
##  2 SQL               1014
##  3 cplusplus          831
##  4 Microsoft_Suite    746
##  5 python             594
##  6 Other_skills       584
##  7 Tableau            528
##  8 Power_bi           510
##  9 Cloud              340
## 10 Machine_learning   202
## 11 Azure              108
## 12 Java                78
## 13 Git                 29
## 14 Visio               28
## 15 gcp                 15
## 16 Jupyter             12
## 17 Matlab               9

Frequency analysis for other skills

softskills <- freq_softskills %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble(softskills)

## # A tibble: 10 × 2
##    skill              freq
##    <chr>             <int>
##  1 Experienced        1533
##  2 Collaborative      1339
##  3 Communication      1151
##  4 TimeManagement     1047
##  5 Leadership          695
##  6 Remote              656
##  7 Visualization       601
##  8 Critical_analysis   502
##  9 Problem_solving     181
## 10 Confidence          139

ggplot(hardskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Jobs') + coord_flip()

ggplot(softskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="orange") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Jobs') + coord_flip()

Data Science Jobs

Here we look at the hard skills required for Data Science jobs

freq_hardskills_science<- required_skills %>%
    filter(new_title == 'Data Science') %>%  
    mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
    mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
    mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
    mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
    mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
    mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
    mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
    mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
    mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
    mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
    mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
    mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
    mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
    mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
    mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
    mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
  select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)

Desired soft skills for Data Science jobs

freq_softskills_science<- required_skills %>%
    filter(new_title == 'Data Science') %>%  
    mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
    mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
    mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
    mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
    mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
    mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
    mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
    mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%      
    mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
    mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
  select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)

Data Science Jobs Analysis:

hardskills_science <- freq_hardskills_science %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble(hardskills_science)

## # A tibble: 17 × 2
##    skill             freq
##    <chr>            <int>
##  1 R                  247
##  2 SQL                242
##  3 cplusplus          205
##  4 python             201
##  5 Other_skills       192
##  6 Microsoft_Suite    188
##  7 Power_bi           177
##  8 Tableau            140
##  9 Machine_learning   109
## 10 Cloud               80
## 11 Azure                8
## 12 Java                 6
## 13 Visio                5
## 14 Git                  5
## 15 Matlab               2
## 16 gcp                  0
## 17 Jupyter              0

ggplot(hardskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="red") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Science') + coord_flip()

softskills_science <- freq_softskills_science %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble(softskills_science)

## # A tibble: 10 × 2
##    skill              freq
##    <chr>             <int>
##  1 Experienced         318
##  2 Collaborative       298
##  3 Communication       252
##  4 Visualization       212
##  5 Leadership          210
##  6 Critical_analysis   168
##  7 TimeManagement      153
##  8 Remote               85
##  9 Problem_solving      82
## 10 Confidence           16

ggplot(softskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="lightblue") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Science ') + coord_flip()

Data Engineering Jobs

freq_hardskills_engineer<- required_skills %>%
    filter(new_title == 'Data Engineer') %>%  
    mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
    mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
    mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
    mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
    mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
    mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
    mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
    mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
    mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
    mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
    mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
    mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
    mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
    mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
    mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
    mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
  select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)

freq_softskills_engineer<- required_skills %>%
    filter(new_title == 'Data Engineer') %>%  
    mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
    mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
    mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
    mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
    mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
    mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
    mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
    mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%      
    mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
    mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
  select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)

Data Engineering Jobs Analysis:

hardskills_engineer <- freq_hardskills_engineer %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble (hardskills_engineer)

## # A tibble: 17 × 2
##    skill             freq
##    <chr>            <int>
##  1 SQL                385
##  2 R                  343
##  3 python             228
##  4 cplusplus          222
##  5 Tableau            183
##  6 Other_skills       183
##  7 Microsoft_Suite    178
##  8 Cloud              160
##  9 Power_bi           147
## 10 Azure               85
## 11 Machine_learning    65
## 12 Java                51
## 13 Git                 17
## 14 gcp                 12
## 15 Visio               12
## 16 Jupyter              4
## 17 Matlab               3

ggplot(hardskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="chartreuse1") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Engineering') + coord_flip()

softskills_engineer <- freq_softskills_engineer %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble(softskills_engineer)

## # A tibble: 10 × 2
##    skill              freq
##    <chr>             <int>
##  1 Experienced         472
##  2 Collaborative       419
##  3 Communication       355
##  4 TimeManagement      321
##  5 Remote              239
##  6 Leadership          212
##  7 Visualization       195
##  8 Critical_analysis   145
##  9 Problem_solving      55
## 10 Confidence           52

ggplot(softskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="deeppink") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Engineering') + coord_flip()

Data Analyst Jobs

freq_hardskills_analyst<- required_skills %>%
    filter(new_title == 'Data Analyst') %>% 
    mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
    mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
    mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
    mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
    mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
    mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
    mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
    mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
    mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
    mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
    mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
    mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
    mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
    mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
    mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
    mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
    mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
  select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)

freq_softskills_analyst<- required_skills %>%
    filter(new_title == 'Data Analyst') %>% 
    mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
    mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
    mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
    mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
    mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
    mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
    mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
    mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%      
    mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
    mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
  select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)

Data Analyst Jobs Analysis:

hardskills_analyst <- freq_hardskills_analyst %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble (hardskills_analyst)

## # A tibble: 17 × 2
##    skill             freq
##    <chr>            <int>
##  1 R                  431
##  2 cplusplus          404
##  3 Microsoft_Suite    389
##  4 SQL                387
##  5 Other_skills       209
##  6 Tableau            205
##  7 Power_bi           186
##  8 python             165
##  9 Cloud              100
## 10 Machine_learning    28
## 11 Java                21
## 12 Azure               15
## 13 Visio               11
## 14 Jupyter              8
## 15 Git                  7
## 16 Matlab               4
## 17 gcp                  3

ggplot(hardskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="goldenrod") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Analyst') + coord_flip()

softskills_analyst <- freq_softskills_analyst %>%
  select(-1)%>%
  summarise_all(sum)%>%
  gather(skill, freq)%>%
  arrange(desc(freq))

tibble (softskills_analyst)

## # A tibble: 10 × 2
##    skill              freq
##    <chr>             <int>
##  1 Experienced         743
##  2 Collaborative       622
##  3 TimeManagement      573
##  4 Communication       544
##  5 Remote              332
##  6 Leadership          273
##  7 Visualization       194
##  8 Critical_analysis   189
##  9 Confidence           71
## 10 Problem_solving      44

ggplot(softskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="brown") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Analyst') + coord_flip()

Conclusion:

Based on our analysis, we concluded that the top 5 valued hard skills for Data Jobs are:

R
SQL
C++
Microsoft Suite
Python

Similarly, the top 5 valued soft skills are:

Experience
Collaboration
Communication
Time management
Leadership

# Create and display the first plot
plot1 <- ggplot(hardskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="red") + xlab('') + ylab('Frequency') + labs(title='Data Scientist') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))

# Create and display the second plot
plot2 <- ggplot(hardskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="chartreuse1") + xlab('') + ylab('Frequency') + labs(title='Data Engineer') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))

# Create and display the third plot
plot3<- ggplot(hardskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="purple") + xlab('') + ylab('Frequency') + labs(title='Data Analyst') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))


combined_plots <- grid.arrange(plot1, plot2, plot3, ncol = 3, top = "Desired Hard skills")

The plots above indicate that the most 5 most valued hard skills for the following jobs are:

Data science: R, SQL, C++, Python, Microsoft Suite
Data engineer: SQL, R, Python, C++, Tableau
Data analyst: R, C++, Microsoft Suite, SQL, Tableau

# Create and display the first plot
plot1 <- ggplot(softskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Data Scientist') + coord_flip() +theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10))  

# Create and display the second plot
plot2 <- ggplot(softskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="yellow") + xlab('') + ylab('Frequency') + labs(title='Data Engineer') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10))  

# Create and display the third plot
plot3<- ggplot(softskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="brown") + xlab('') + ylab('Frequency') + labs(title='Data Analyst') + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10)) 

combined_plots <- grid.arrange(plot1, plot2, plot3, ncol = 3, top = "Desired Soft skills")

The plots above indicate that the most 5 most valued soft skills for the following jobs are:

Data science: experience, collaboration, communication, visualization , leadership.
Data engineer: experience, collaboration, communication, time management , remote.
Data analyst: experience, collaboration, time management, communication , remote.

Project 3 - The Most Valued Data Science Skills

Kossi Akplaka, Carol Campbell, Saloua Daouki, Souleymane Doumbia

2023-10-29