The goal of this project is to find and use data to answer the question “Which are the most valued data science. Our group members are: Carol Campbell, Kossi Akplaka, Souleymane Doumbia and Saloua Daouki.
We began by searching various websites for suitable data before finally settling on a Kaggle dataset by Luke Barousse, entitled Data Analyst Skill Analysis. This file was then uploaded into our Github Project 3 repository as “Data_Science_job.csv.”
url3 <- 'https://raw.githubusercontent.com/Group-Project-Data-607/Project3/main/Data_Science_job.csv'
df <- read.csv(file = url3)
tibble (df)
## # A tibble: 1,820 × 43
## X index title company_name location Job.posting.location description
## <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 0 0 Enterpris… Children’s … Kansas … via AARP Job Board "Thanks fo…
## 2 1 1 Senior Co… Harnham United … via WJHL Jobs "Senior Co…
## 3 2 2 Business … Lockton Kansas … via Lockton - Talen… "We are lo…
## 4 3 3 PowerBI D… Mitchell Ma… United … via Mitchell Martin… "Our clien…
## 5 4 4 Sr SaaS B… Cyber Resou… United … via Central Illinoi… "Job Title…
## 6 5 5 Sr. Syste… General Dyn… Anywhere via Clearance Jobs "REQ#: RQ1…
## 7 6 6 Sr. Data … CRB Kansas … via Smart Recruiter… "Company D…
## 8 7 7 Senior An… Ada United … via Simplify Jobs "Ada was f…
## 9 8 8 Data De-I… PwC United … via Mendeley "• *Specia…
## 10 9 9 Data Anal… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 36 more variables: extensions <chr>, job_id <chr>, thumbnail <chr>,
## # posted_at <chr>, schedule_type <chr>, work_from_home <lgl>, salary <chr>,
## # search_term <chr>, date_time <chr>, search_location <chr>,
## # commute_time <lgl>, salary_pay <chr>, salary_rate <chr>, salary_avg <dbl>,
## # salary_min <dbl>, salary_max <dbl>, salary_hourly <dbl>,
## # salary_yearly <dbl>, salary_standardized <dbl>, Required.skills <chr>, …
This dataset contained many elements that we deemed necessary for our analysis - a variety of job listings from various companies from all over country, each listing desired skills for the ideal candidate. While ideal from a data entry standpoint, a major challenge that we encountered was that the data was far from tidy. There were multiple skills listed in one variable, vague job titles that did not match the job description, and way too many variables for the task at hand. Thus let the tidying begin.
The data included job descriptions that were vague and made it difficult to ascertain the true nature of the job posting. Therefore, we took the liberty of creating job titles based on their respective descriptions, and decided to group the jobs as “Data Scientist”, “Data Engineer”, “Data Analyst”.
For this following job title, I’ll look for this following keywords in the description column:
data science: Machine learning, modeling, A/B testing
data engineer: ETL, pipelines, warehouse, architect
data analyst: All other
If a job description contains both “data analyst” and “data science” keywords, the code will categorize it based on the first condition that matches.
skills <- df %>%
mutate(new_title = case_when(
grepl("ETL|pipeline|warehouse|architect", description, ignore.case = TRUE) ~ "Data Engineer",
grepl("machine learning | modeling|A/B testing", description, ignore.case = TRUE) ~ "Data Science",
TRUE ~ "Data Analyst"
))
tibble (skills)
## # A tibble: 1,820 × 44
## X index title company_name location Job.posting.location description
## <int> <int> <chr> <chr> <chr> <chr> <chr>
## 1 0 0 Enterpris… Children’s … Kansas … via AARP Job Board "Thanks fo…
## 2 1 1 Senior Co… Harnham United … via WJHL Jobs "Senior Co…
## 3 2 2 Business … Lockton Kansas … via Lockton - Talen… "We are lo…
## 4 3 3 PowerBI D… Mitchell Ma… United … via Mitchell Martin… "Our clien…
## 5 4 4 Sr SaaS B… Cyber Resou… United … via Central Illinoi… "Job Title…
## 6 5 5 Sr. Syste… General Dyn… Anywhere via Clearance Jobs "REQ#: RQ1…
## 7 6 6 Sr. Data … CRB Kansas … via Smart Recruiter… "Company D…
## 8 7 7 Senior An… Ada United … via Simplify Jobs "Ada was f…
## 9 8 8 Data De-I… PwC United … via Mendeley "• *Specia…
## 10 9 9 Data Anal… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 37 more variables: extensions <chr>, job_id <chr>, thumbnail <chr>,
## # posted_at <chr>, schedule_type <chr>, work_from_home <lgl>, salary <chr>,
## # search_term <chr>, date_time <chr>, search_location <chr>,
## # commute_time <lgl>, salary_pay <chr>, salary_rate <chr>, salary_avg <dbl>,
## # salary_min <dbl>, salary_max <dbl>, salary_hourly <dbl>,
## # salary_yearly <dbl>, salary_standardized <dbl>, Required.skills <chr>, …
Here we parsed out only the columns necessary for our analysis, reducing 43 columns/variables to 8 columns/variables.
required_skills <- subset(skills, select = c('new_title', 'title', 'company_name', 'location', 'Job.posting.location', 'description', 'schedule_type', 'Required.skills'))
#rename required_skills column to skills
colnames(required_skills)[8] ="skills"
tibble (required_skills)
## # A tibble: 1,820 × 8
## new_title title company_name location Job.posting.location description
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Data Analyst Enterpr… Children’s … Kansas … via AARP Job Board "Thanks fo…
## 2 Data Science Senior … Harnham United … via WJHL Jobs "Senior Co…
## 3 Data Science Busines… Lockton Kansas … via Lockton - Talen… "We are lo…
## 4 Data Analyst PowerBI… Mitchell Ma… United … via Mitchell Martin… "Our clien…
## 5 Data Engineer Sr SaaS… Cyber Resou… United … via Central Illinoi… "Job Title…
## 6 Data Engineer Sr. Sys… General Dyn… Anywhere via Clearance Jobs "REQ#: RQ1…
## 7 Data Engineer Sr. Dat… CRB Kansas … via Smart Recruiter… "Company D…
## 8 Data Engineer Senior … Ada United … via Simplify Jobs "Ada was f…
## 9 Data Engineer Data De… PwC United … via Mendeley "• *Specia…
## 10 Data Analyst Data An… Adtalem Glo… Anywhere via Adtalem Global … "Opportuni…
## # ℹ 1,810 more rows
## # ℹ 2 more variables: schedule_type <chr>, skills <chr>
Remove the special characters in from the ‘skills’ column for downstream analysis
#this code taken from: https://rdrr.io/github/YangWu1227/citizenr/src/R/clearning-helpers.R
#function to remove special characters across dataframe
rm_spl_char <- function(df, var) {
if (is.data.frame(df) == FALSE) {
stop("'df' must be a data frame", call. = FALSE)
}
if (!is_character(var) | vec_size(var) > length(df)) {
stop("'var' must be a character vector no greater than length(df)", call. = FALSE)
}
if (!all(var %in% names(df))) {
stop("'var' must be columns found in 'df'", call. = FALSE)
}
}
# remove special characters from skills column
required_skills[8] <- required_skills[8] %>%
mutate(
across(
.cols = everything(),
.fns = ~ str_replace_all(.x, "[^[:alnum:]^.]", " ")
)
)
tibble(required_skills[8])
## # A tibble: 1,820 × 1
## skills
## <chr>
## 1 " "
## 2 " sql excel python "
## 3 " r sql azure python "
## 4 " power bi "
## 5 " sql jira "
## 6 " python tableau r sql excel "
## 7 " power bi ssis python tableau dax powershell r git …
## 8 " go spark c sql "
## 9 " python excel "
## 10 " spss tableau excel "
## # ℹ 1,810 more rows
We were able to determine that of the three types of jobs in this dataset, Data Analyst was the most sought after with 1012 postings, followed by Data Engineer with 481 postings, then Data Science with 327 postings.
## Count of the different title
table(required_skills$new_title)
##
## Data Analyst Data Engineer Data Science
## 1012 481 327
There are a number of technological and data visualization skills that employers consider essential for data scientist candidates. Also known as “hard kills”, these requirements include competence in the latest software and statistical analysis tools, strong mathematical, quantitative and analytical skills. Here, our next task was filter and group by like software skills in preparation for analysis.
freq_hardskills<- required_skills %>%
mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet", skills, ignore.case=TRUE)) %>%
mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)
While “hard skills” are important to prospective employers, strong “soft skills” are also necessary for optimal career performance. Also known as “transferable skills”, soft skills include the ability to effectively communicate across and within their respective teams, with vendors and other stakeholders. Leadership, time management and problem solving are other soft skills that are often sought after. Lastly, one must also be able to present findings in non-technical terms, easily understood by a universal audience.
freq_softskills<- required_skills %>%
mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%
mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)
Frequency analysis of hard skills
hardskills <- freq_hardskills %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble(hardskills)
## # A tibble: 17 × 2
## skill freq
## <chr> <int>
## 1 R 1021
## 2 SQL 1014
## 3 cplusplus 831
## 4 Microsoft_Suite 746
## 5 python 594
## 6 Other_skills 584
## 7 Tableau 528
## 8 Power_bi 510
## 9 Cloud 340
## 10 Machine_learning 202
## 11 Azure 108
## 12 Java 78
## 13 Git 29
## 14 Visio 28
## 15 gcp 15
## 16 Jupyter 12
## 17 Matlab 9
Frequency analysis for other skills
softskills <- freq_softskills %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble(softskills)
## # A tibble: 10 × 2
## skill freq
## <chr> <int>
## 1 Experienced 1533
## 2 Collaborative 1339
## 3 Communication 1151
## 4 TimeManagement 1047
## 5 Leadership 695
## 6 Remote 656
## 7 Visualization 601
## 8 Critical_analysis 502
## 9 Problem_solving 181
## 10 Confidence 139
ggplot(hardskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Jobs') + coord_flip()
ggplot(softskills,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="orange") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Jobs') + coord_flip()
Here we look at the hard skills required for Data Science jobs
freq_hardskills_science<- required_skills %>%
filter(new_title == 'Data Science') %>%
mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)
Desired soft skills for Data Science jobs
freq_softskills_science<- required_skills %>%
filter(new_title == 'Data Science') %>%
mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%
mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)
hardskills_science <- freq_hardskills_science %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble(hardskills_science)
## # A tibble: 17 × 2
## skill freq
## <chr> <int>
## 1 R 247
## 2 SQL 242
## 3 cplusplus 205
## 4 python 201
## 5 Other_skills 192
## 6 Microsoft_Suite 188
## 7 Power_bi 177
## 8 Tableau 140
## 9 Machine_learning 109
## 10 Cloud 80
## 11 Azure 8
## 12 Java 6
## 13 Visio 5
## 14 Git 5
## 15 Matlab 2
## 16 gcp 0
## 17 Jupyter 0
ggplot(hardskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="red") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Science') + coord_flip()
softskills_science <- freq_softskills_science %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble(softskills_science)
## # A tibble: 10 × 2
## skill freq
## <chr> <int>
## 1 Experienced 318
## 2 Collaborative 298
## 3 Communication 252
## 4 Visualization 212
## 5 Leadership 210
## 6 Critical_analysis 168
## 7 TimeManagement 153
## 8 Remote 85
## 9 Problem_solving 82
## 10 Confidence 16
ggplot(softskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="lightblue") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Science ') + coord_flip()
freq_hardskills_engineer<- required_skills %>%
filter(new_title == 'Data Engineer') %>%
mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)
freq_softskills_engineer<- required_skills %>%
filter(new_title == 'Data Engineer') %>%
mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%
mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)
hardskills_engineer <- freq_hardskills_engineer %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble (hardskills_engineer)
## # A tibble: 17 × 2
## skill freq
## <chr> <int>
## 1 SQL 385
## 2 R 343
## 3 python 228
## 4 cplusplus 222
## 5 Tableau 183
## 6 Other_skills 183
## 7 Microsoft_Suite 178
## 8 Cloud 160
## 9 Power_bi 147
## 10 Azure 85
## 11 Machine_learning 65
## 12 Java 51
## 13 Git 17
## 14 gcp 12
## 15 Visio 12
## 16 Jupyter 4
## 17 Matlab 3
ggplot(hardskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="chartreuse1") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Engineering') + coord_flip()
softskills_engineer <- freq_softskills_engineer %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble(softskills_engineer)
## # A tibble: 10 × 2
## skill freq
## <chr> <int>
## 1 Experienced 472
## 2 Collaborative 419
## 3 Communication 355
## 4 TimeManagement 321
## 5 Remote 239
## 6 Leadership 212
## 7 Visualization 195
## 8 Critical_analysis 145
## 9 Problem_solving 55
## 10 Confidence 52
ggplot(softskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="deeppink") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Engineering') + coord_flip()
freq_hardskills_analyst<- required_skills %>%
filter(new_title == 'Data Analyst') %>%
mutate(R = grepl("r|ggplot", skills, ignore.case=TRUE)) %>%
mutate(python = grepl("Python|pandas|numpy|matplotlib", skills, ignore.case=TRUE)) %>%
mutate(SQL = grepl("SQL|postgresql|mysql|nosql", skills, ignore.case=TRUE)) %>%
mutate(Microsoft_Suite = grepl("excel|word|power point|spreadsheet|outlook|sharepoint|vba", skills, ignore.case=TRUE)) %>%
mutate(Azure = grepl("azur", skills, ignore.case=TRUE)) %>%
mutate(Power_bi = grepl("power bi|power_bi", skills, ignore.case=TRUE)) %>%
mutate(Tableau = grepl("tableau", skills, ignore.case=TRUE)) %>%
mutate(Matlab = grepl("matlab", skills, ignore.case=TRUE)) %>%
mutate(Java = grepl("java|javascript", skills, ignore.case=TRUE)) %>%
mutate(gcp = grepl("gcp", skills, ignore.case=TRUE)) %>%
mutate(cplusplus = grepl("c", skills, ignore.case=TRUE)) %>%
mutate(Jupyter = grepl("jupyter", skills, ignore.case=TRUE)) %>%
mutate(Visio = grepl("visio", skills, ignore.case=TRUE)) %>%
mutate(Git = grepl("git", skills, ignore.case=TRUE)) %>%
mutate(Machine_learning = grepl("machine learning|hadoop", description, ignore.case=TRUE))%>%
mutate(Cloud = grepl("aws|snowflake", description, ignore.case=TRUE))%>%
mutate(Other_skills = grepl("spss|sas|ssis|redshift|crystal|looker|airflow|spark|go|linus|unix|rust|alteryx|microstrategy", skills, ignore.case=TRUE)) %>%
select(new_title, R, python, SQL, Microsoft_Suite, Azure, Power_bi, Tableau, Matlab, Java, gcp, cplusplus,Jupyter,Visio, Git, Machine_learning, Cloud, Other_skills)
freq_softskills_analyst<- required_skills %>%
filter(new_title == 'Data Analyst') %>%
mutate(Communication = grepl("communicat", description, ignore.case=TRUE)) %>%
mutate(Collaborative = grepl("collaborat|group|team", description, ignore.case=TRUE)) %>%
mutate(TimeManagement = grepl("manage", description, ignore.case=TRUE)) %>%
mutate(Leadership = grepl("leader", description, ignore.case=TRUE)) %>%
mutate(Remote = grepl("remote", description, ignore.case=TRUE)) %>%
mutate(Experienced = grepl("experience", description, ignore.case=TRUE)) %>%
mutate(Visualization = grepl("visualization", description, ignore.case=TRUE)) %>%
mutate(Problem_solving = grepl("problem solving", description, ignore.case=TRUE)) %>%
mutate(Critical_analysis = grepl("critical", description, ignore.case=TRUE)) %>%
mutate(Confidence = grepl("confiden", description, ignore.case=TRUE)) %>%
select(new_title, Communication, Collaborative, TimeManagement, Leadership, Remote, Experienced, Visualization, Problem_solving, Critical_analysis, Confidence)
hardskills_analyst <- freq_hardskills_analyst %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble (hardskills_analyst)
## # A tibble: 17 × 2
## skill freq
## <chr> <int>
## 1 R 431
## 2 cplusplus 404
## 3 Microsoft_Suite 389
## 4 SQL 387
## 5 Other_skills 209
## 6 Tableau 205
## 7 Power_bi 186
## 8 python 165
## 9 Cloud 100
## 10 Machine_learning 28
## 11 Java 21
## 12 Azure 15
## 13 Visio 11
## 14 Jupyter 8
## 15 Git 7
## 16 Matlab 4
## 17 gcp 3
ggplot(hardskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="goldenrod") + xlab('') + ylab('Frequency') + labs(title='Valued Hard Skills - Data Analyst') + coord_flip()
softskills_analyst <- freq_softskills_analyst %>%
select(-1)%>%
summarise_all(sum)%>%
gather(skill, freq)%>%
arrange(desc(freq))
tibble (softskills_analyst)
## # A tibble: 10 × 2
## skill freq
## <chr> <int>
## 1 Experienced 743
## 2 Collaborative 622
## 3 TimeManagement 573
## 4 Communication 544
## 5 Remote 332
## 6 Leadership 273
## 7 Visualization 194
## 8 Critical_analysis 189
## 9 Confidence 71
## 10 Problem_solving 44
ggplot(softskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="brown") + xlab('') + ylab('Frequency') + labs(title='Valued Soft Skills - Data Analyst') + coord_flip()
Based on our analysis, we concluded that the top 5 valued hard skills for Data Jobs are:
R
SQL
C++
Microsoft Suite
Python
Similarly, the top 5 valued soft skills are:
Experience
Collaboration
Communication
Time management
Leadership
# Create and display the first plot
plot1 <- ggplot(hardskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="red") + xlab('') + ylab('Frequency') + labs(title='Data Scientist') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))
# Create and display the second plot
plot2 <- ggplot(hardskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="chartreuse1") + xlab('') + ylab('Frequency') + labs(title='Data Engineer') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))
# Create and display the third plot
plot3<- ggplot(hardskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="purple") + xlab('') + ylab('Frequency') + labs(title='Data Analyst') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1))
combined_plots <- grid.arrange(plot1, plot2, plot3, ncol = 3, top = "Desired Hard skills")
The plots above indicate that the most 5 most valued hard skills for the following jobs are:
Data science: R, SQL, C++, Python, Microsoft Suite
Data engineer: SQL, R, Python, C++, Tableau
Data analyst: R, C++, Microsoft Suite, SQL, Tableau
# Create and display the first plot
plot1 <- ggplot(softskills_science,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="blue") + xlab('') + ylab('Frequency') + labs(title='Data Scientist') + coord_flip() +theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10))
# Create and display the second plot
plot2 <- ggplot(softskills_engineer,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="yellow") + xlab('') + ylab('Frequency') + labs(title='Data Engineer') + coord_flip() + theme(plot.title = element_text(size = 10)) + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10))
# Create and display the third plot
plot3<- ggplot(softskills_analyst,aes(x=reorder(skill, freq), y=freq)) + geom_bar(stat='identity',fill="brown") + xlab('') + ylab('Frequency') + labs(title='Data Analyst') + coord_flip() + theme(axis.text.x=element_text(angle=45, hjust=1)) + theme(plot.title = element_text(size = 10))
combined_plots <- grid.arrange(plot1, plot2, plot3, ncol = 3, top = "Desired Soft skills")
The plots above indicate that the most 5 most valued soft skills for the following jobs are:
Data science: experience, collaboration, communication, visualization , leadership.
Data engineer: experience, collaboration, communication, time management , remote.
Data analyst: experience, collaboration, time management, communication , remote.