library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Clean job data for project 3 across industries
JobData <- tbl_df(read.csv("Project3_CleanData.csv", stringsAsFactors = FALSE, check.names = FALSE))
Which skill sets are the most freqents in terms of keywords?
FinanceAndInsurData_SkillSetCount <- JobData %>%
filter(Industry == "Finance") %>%
count(`Skill Set`, sort= TRUE)
## Warning: package 'bindrcpp' was built under R version 3.4.2
# count() is a short-hand for group_by() + tally()
(FinanceAndInsurData_SkillSetCount)
## # A tibble: 5 x 2
## `Skill Set` n
## <chr> <int>
## 1 Programming/Technical 87
## 2 Soft Skill 17
## 3 Business 11
## 4 Analysis/Research 5
## 5 Mathematics 3
As we can see above, Programming/Technical and soft skills are most important while business related keywords are much less frequent. It may imply that data scientist job in Financial / Insurance job, business knowledge is not the major focus.
sO, WHICH Skill Type are most variable in Programming/Technical and Soft Skills?
FinanceAndInsurData_Programming_tech <- JobData %>%
filter(Industry == "Finance" & `Skill Set` == "Programming/Technical") %>%
count(`Skill Type`, sort=TRUE)
(FinanceAndInsurData_Programming_tech)
## # A tibble: 54 x 2
## `Skill Type` n
## <chr> <int>
## 1 Python 7
## 2 Java 5
## 3 R 5
## 4 Hadoop 4
## 5 Scala 4
## 6 C++ 3
## 7 Spark 3
## 8 SQL 3
## 9 Build Machine Learning Models 2
## 10 Data Mining 2
## # ... with 44 more rows
Python, Java, R, Hadoop are must have. Some skills such as machine learning and cloud technlogy are important too but it could be platform specific (e.g. Google Machine learning)
FinanceAndInsurData_SoftSkills <- JobData %>%
filter(Industry == "Finance" & `Skill Set` == "Soft Skill") %>%
count(`Skill Type`, sort=TRUE)
(FinanceAndInsurData_SoftSkills)
## # A tibble: 12 x 2
## `Skill Type` n
## <chr> <int>
## 1 Good Communication 5
## 2 Collaboration 2
## 3 Agile 1
## 4 Attention To Detail 1
## 5 Curiosity 1
## 6 Deadline Management 1
## 7 Friendly 1
## 8 Good Work Ethic 1
## 9 Positive 1
## 10 Quantitative 1
## 11 Self-Starter 1
## 12 Work Independantly 1
Communication and collaboration are the most important while personality such as Friendly, Positive, Curiosity, Detail are also considered.
ggplot(FinanceAndInsurData_SoftSkills, aes(x = FinanceAndInsurData_SoftSkills$`Skill Type`, y = FinanceAndInsurData_SoftSkills$n, fill = FinanceAndInsurData_SoftSkills$n)) +
geom_bar(stat = "identity") +
xlab("Skills") +
ylab("Freq.") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 65, hjust = 1)) +
ggtitle("Soft Skills of Data Scientist in Finance/Insurance")
So for Finance industry, hard skills such as Python, Java, R, Hadoop are must Have. Be able to communicate and work together are most important soft skills.