This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists.
Please note especially the requirement about making a presentation during our first meetup after the project is due.
W. Edwards Deming said, “In God we trust, all others must bring data.”
Please use data to answer the question:
“Which are the most valued data science skills?”
Consider your work as an exploration: there is not necessarily a “right answer.”
R.Database job is created on each team members local machine using MySQL. Using an Open Data Base Connectivity (ODBC) Driver. A connection is established using the Data Source Name (DSN) linking R to the appropriate user with SELECT and INSERT privileges. All files are made available on a shared githube repository DATA607_CUNY_2021_Project3
## [1] "job_opening_tbl" "location_dim"
## [1] "uniq_id" "location" "city" "country" "state" "zip_code"
## [1] "uniq_id" "job_title" "job_descr"
Function OD_DL_csv is created in order to access the necessary CSV files containing over 30,000 observations. This method of was select over loading from Github, due to storage limitations with files this large. shared file URL and csv names are stored into variables url_location, url_job_openings, csv_location and csv_openings respectively.
OD_DL_csv takes the shared file, splits it into a list, uses unlist so further split the elements and them pastes for a downloadable version. from there curl_download from the curl packages is used to complete the download.
OD_DL_csv <- function(sharedURL, file_name, save2wd = FALSE){
# Save the shared url
URL1 <- unlist(strsplit(sharedURL,"[?]"))[1]
URL1 <- paste0(URL1,"?download=1") # edit URL to make it a downloadable link
# Download the file to a temp directory using the supplied file name
curl::curl_download(
URL1,
destfile = file.path(tempdir(), file_name),
mode = "wb"
)
# If the user wants it saved to thier working directory this will copy the file
if(isTRUE(save2wd)){
file.copy(
from = paste0(tempdir(),"\\", file_name),
to = "./")
}
# return the CSV as a data.frame
return(read.csv(paste0(tempdir(), "\\" ,file_name), stringsAsFactors = FALSE))
}Below the variables to be used in retrieve our data is stored.
url_Cleaned_DS <- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EdcuUpO6t1RIqyHw-OP90UQB2qRYY3fcW5RPyA_xT348Qw?e=4HeXbe"
url_DS_job_market<- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/Eai1ECgMhthLmrut-eIdD5oBw4715IvolGgyG54owY0nWg?e=D0iRKm"
url_location<-
"https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EYAX_5NCc_hFtSUFY0W4NPMBEXiRYtgro4V3vrSJbpTcag?e=aRKrpf"
url_job_openings<-
"https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EWni58F8o9BKqLwvF-FDdIEBX7nGv3l7PeueVkrUEPhJ_A?e=9hOyGN"
csv_locations<-
"job_locations.csv"
csv_job_openings<-
"job_openings.csv"
csv_DS_job_market <-
"alldata.csv"
csv_Cleaned_DS<-
"Cleaned_DS_Jobs.csv"Calling the OD_DL_csv() function the csv files are loaded to location_dim_df and job_opening_df
filter() is used to select only the rows associated with the word Data.
# to filter 30,000 rows in rmd and lighten the load to MySQL
job_opening_df <- job_opening_df %>%
filter(grepl('Data',job_title))# to filter 30,000 rows in rmd and lighten the load to MySQL
location_dim_df <- location_dim_df %>%
filter(uniq_id %in% job_opening_df$uniq_id)gsub() along with a specific regex command formats the HTML formatting to a plain text form.
By taking the existing column names and adding 40,000 to them, the unique_id column is created and made outside of the range of our previous data frames, ensuring there are no redundancies.
# add unique ID column based on row name of each entry
DS_job_df<-cbind( data.frame("uniq_id" = as.integer(rownames(DS_job_df))), DS_job_df)
# add 40K to each uniq_id to ensure none match the first 30K entries of the other sets
DS_job_df<-DS_job_df %>%
mutate_at( vars("uniq_id") ,
funs(.+40000))# splitting up the location column to create state, city, zip_code
temp<-as.data.frame(str_split_fixed(DS_job_df$location,",",2))
temp$V2<-str_trim(temp$V2)
temp$V2<-str_replace_all(temp$V2, "\\s", " == ")
DS_job_df<-(temp<-cbind(DS_job_df,"city" = temp$V1,
as.data.frame(str_split_fixed(temp$V2,"==",2)))%>%
dplyr::rename("state"=V1, "zip_code"=V2))
# loading country column with a constan "United State" value
DS_job_df$country<-"United States"
# quick rearrange
DS_job_df<-DS_job_df %>%
select(uniq_id,position,description,location,city,state,country,zip_code,company,reviews)Using dbWriteTable function, we are able to write these imported dataframes into the appropriate Database tables on our local Database.
The database consists of 2 tables in 2nd Normal Form as shown below
ER Diagram
dbWriteTable(db_conn, name = "location_dim", value = location_dim_df, append = TRUE,row.names = FALSE)dbWriteTable(db_conn, name = "job_opening_tbl", value = job_opening_df, append = TRUE, row.names = FALSE)An SQL in the Project 3 Repo is used, to retreive the merged data, filtered by job titles, containing the word \(DATA\).
## chr "SELECT j.uniq_id, j.job_title, j.job_descr, l.location FROM jobs.job_opening_tbl j, jobs.location_dim l where j"| __truncated__
## [1] "data.frame"
phrases <- result_set$job_descr
#view(phrases)
createNgram <-function(stringVector, ngramSize){
ngram <- data.table()
ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)
if(ngramSize==1){
ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
else {
ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
}
return(ngram)
}## [1] "w1" "freq" "length"
head(res %>% arrange(desc(freq)),20)%>%
knitr::kable(caption = "Projection of Word Frequency ")%>%
kableExtra::kable_styling(bootstrap_options = "striped")| w1 | freq | length |
|---|---|---|
| and | 18118 | 3 |
| to | 9548 | 2 |
| of | 7340 | 2 |
| the | 7333 | 3 |
| a | 5924 | 1 |
| in | 5570 | 2 |
| data | 5557 | 4 |
| with | 4827 | 4 |
| for | 3371 | 3 |
| or | 3350 | 2 |
| is | 2302 | 2 |
| our | 1986 | 3 |
| as | 1882 | 2 |
| experience | 1746 | 10 |
| on | 1740 | 2 |
| business | 1646 | 8 |
| will | 1641 | 4 |
| that | 1449 | 4 |
| team | 1448 | 4 |
| are | 1426 | 3 |
Plotting zip codes
top_states <- location_dim_df$zip_code
zip.plot(location_dim_df, zip.file = system.file("extdata", "zips.tab", package =
"muRL"), map.type = "usa", cex = .2, col = "orange", pch = 2,
jitter.factor = NULL)Using the N-gram function which can extract single, double or more word combinations, we extracted the most common hard, technical, and soft skills we know from prior experiences and from the articles linked below. Below are the Top 20 Hard and Technical Skills in order of frequency.
The Most In-Demand Tech Skills for Data Scientists
Data Scientist Resume Sample Template & Data-Driven Guide
We used the N-gram function twice to extract single word skills, and then once again to extract double worded skills such as "quantitative analysis. In conjunction with the N-gram function, we leveraged dplyr functions to tidy our data, ggplot for the bar graphs and reactable for the table visual of our data.
# hard skills:
#https://towardsdatascience.com/the-most-in-demand-tech-skills-for-data-scientists-d716d10c191d
# technical, hard and soft skills:
#https://zety.com/blog/data-scientist-resume-example
# res1$w1 = tolower(res1s$w1) sets all words to lowercase
job_descr_phrases <- result_set$job_descr
job_descr_text <- job_descr_phrases
job_phrases_1 <- createNgram(job_descr_text,1)
job_phrases_1$w1 <- tolower(job_phrases_1$w1) # convert to lowercase for accurate counting
job_phrases_2 <- createNgram(job_descr_text,2)
job_phrases_2$w1w2 <- tolower(job_phrases_2$w1w2)
# hard and tech skills single word skills
techskill = c("python", "R", "r","sql", "spark", "hadoop","java",
"tableau", "aws", "sas", "hive", "tensorflow", "scala",
"c++", "excel", "azure", "mathematics", "statistitcs",
"programming", "debugging", "probability", "modeling",
"matplotlib", "openrefine", "matlab", "bigml", "d3.js",
"excel","ggplot2","jupyter","nltk","scikit-learn","tensorflow",
"weka","predictive","ai","mathematics","c", "linux","nosql",
"basic","bayesian","mapreduce")
hardskill= c("data visualization", "quantitative analysis", "data analysis",
"artificial intelligence", "predictive analysis", "predictive modeling",
"bachelors degree", "masters degree")job_phrases_1<-
job_phrases_1 %>%
select(-3) %>%
filter(w1 %in% techskill) %>%
group_by(w1) %>%
summarise_all(funs(sum)) %>%
arrange(desc(freq))
colnames(job_phrases_1)[1] <- "Hard and Tech Skills"#hard and tech double word skills
job_phrases_2<-
job_phrases_2 %>%
select(-3) %>%
filter(w1w2 %in% hardskill) %>%
group_by(w1w2) %>%
summarise_all(funs(sum))%>%
arrange(desc(freq))
colnames(job_phrases_2)[1] <- "Hard and Tech Skills"#combine the single word and two word skills into one dataframe
totaltech <-
rbind(job_phrases_1, job_phrases_2) %>%
arrange(desc(freq))
head(totaltech)%>%
knitr::kable(caption = "Tech Skills for Data Science ")%>%
kableExtra::kable_styling(bootstrap_options = "striped")| Hard and Tech Skills | freq |
|---|---|
| python | 531 |
| sql | 481 |
| modeling | 432 |
| r | 426 |
| programming | 284 |
| spark | 245 |
#bar graph top 20 hard and tech skills
totaltech %>%
top_n(20) %>%
ggplot(aes(fct_reorder(`Hard and Tech Skills`,`freq`), `freq`))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
coord_flip() +
xlab("") +
ylab("Frequency")+
ggtitle("Top 20 Hard and Tech Skills")## Selecting by freq
Note, for soft skills we extracted words similar to one another such as lead, leadership, and leader and then manually grouped them together since they all are suggestive of the same skill. The soft skills were extracted and orgaized in a similar fashion to the above hard and technical skills.
#soft skills single word
job_phrases_3 <- createNgram(job_descr_text,1)
job_phrases_3$w1 <- tolower(job_phrases_3$w1)
softskill = c("communication", "collaboration","teamwork", "collaborate",
"professional", "veteran", "lead", "leadership", "leader",
"innovation", "innovate", "innovative", "collaborative",
"passionate", "creative", "motivated", "integrity", "effectiveness",
"pioneering","communication", "collaboration", "preceptiveness",
"perseptive","teamwork","collaborate")
job_phrases_4 <- createNgram(job_descr_text,2)
job_phrases_4$w1w2 <- tolower(job_phrases_4$w1w2)
softskill2 = c("critical thinking", "problem solving","interpersonal skills",
"time management")job_phrases_3<-
job_phrases_3 %>%
select(-3) %>%
filter(w1 %in% softskill) %>%
group_by(w1) %>%
summarise_all(funs(sum)) %>%
arrange(desc(freq))
colnames(job_phrases_3)[1] <- "Soft Skills"#soft skills double word
job_phrases_4<- job_phrases_4 %>%
select(-3) %>%
filter(w1w2 %in% softskill2) %>%
group_by(w1w2) %>%
summarise_all(funs(sum)) %>%
arrange(desc(freq))
colnames(job_phrases_4)[1] <- "Soft Skills"#combine the single word and two word skills into one dataframe
totalsoft <-
rbind(job_phrases_3, job_phrases_4) %>%
arrange(desc(freq)) #arranges in descending order
totalsoft<-
totalsoft[-c(2,10), ]
totalsoft[7,2] = (302+158+141)
totalsoft<-
totalsoft[-c(8,15), ]
totalsoft[4,2] = (264+149+53)
totalsoft<-
totalsoft[-c(12,10), ]
totalsoft[5,2] = (203+106+91)
totalsoft<- arrange(totalsoft,desc(freq))
head(totalsoft)%>%
knitr::kable(caption = "Soft skills for Data Science jos ")%>%
kableExtra::kable_styling(bootstrap_options = "striped")| Soft Skills | freq |
|---|---|
| leadership | 601 |
| collaborate | 466 |
| professional | 400 |
| communication | 356 |
| veteran | 265 |
| innovative | 264 |
#bar graph top 10 soft skills
totalsoft %>%
top_n(10) %>% #extract top 10
ggplot(aes(fct_reorder(`Soft Skills`,`freq`), `freq`))+
geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
coord_flip() +
xlab("") +
ylab("Frequency")+
ggtitle("Top Ten Soft Skills")## Selecting by freq
#calc the percentage of hard and tech skills
totaltech_freq_pct <- round((totaltech$freq/nrow(result_set))*100, 2)
(totaltech <- cbind(totaltech, totaltech_freq_pct))%>%
knitr::kable(caption = "Percentage of Hard and Tech Skills ")%>%
kableExtra::kable_styling(bootstrap_options = "striped")| Hard and Tech Skills | freq | totaltech_freq_pct |
|---|---|---|
| python | 531 | 79.49 |
| sql | 481 | 72.01 |
| modeling | 432 | 64.67 |
| r | 426 | 63.77 |
| programming | 284 | 42.51 |
| spark | 245 | 36.68 |
| mathematics | 231 | 34.58 |
| ai | 228 | 34.13 |
| predictive | 228 | 34.13 |
| hadoop | 222 | 33.23 |
| data analysis | 204 | 30.54 |
| java | 196 | 29.34 |
| c | 172 | 25.75 |
| hive | 163 | 24.40 |
| data visualization | 132 | 19.76 |
| tableau | 124 | 18.56 |
| sas | 122 | 18.26 |
| scala | 105 | 15.72 |
| azure | 103 | 15.42 |
| basic | 95 | 14.22 |
| excel | 91 | 13.62 |
| linux | 86 | 12.87 |
| artificial intelligence | 83 | 12.43 |
| aws | 78 | 11.68 |
| predictive modeling | 71 | 10.63 |
| matlab | 70 | 10.48 |
| nosql | 66 | 9.88 |
| tensorflow | 61 | 9.13 |
| mapreduce | 60 | 8.98 |
| quantitative analysis | 29 | 4.34 |
| bayesian | 26 | 3.89 |
| probability | 20 | 2.99 |
| jupyter | 14 | 2.10 |
| masters degree | 12 | 1.80 |
| nltk | 8 | 1.20 |
| bachelors degree | 8 | 1.20 |
| debugging | 7 | 1.05 |
| predictive analysis | 6 | 0.90 |
| matplotlib | 5 | 0.75 |
| weka | 5 | 0.75 |
#bar graph of top 15 hard and tech skills as percentage
totaltech %>%
top_n(15) %>%
ggplot(aes(y=reorder(`Hard and Tech Skills`,freq),x=freq,fill=`Hard and Tech Skills`)) +
geom_bar(stat = 'identity',position=position_dodge()) +
geom_text(aes(label=totaltech_freq_pct), vjust=1.0, color="black",
position = position_dodge(0.9), size=3.0) +
labs(y = ("Hard and Tech Skills"),x = ("Hard and Tech Skill Count"),
title = ("Percentage of Hard/Technical Skills Found in Data Science Job Opps")) +
theme_minimal()## Selecting by totaltech_freq_pct
totalsoft_freq_pct <- round((totalsoft$freq/nrow(result_set))*100, 2)
(totalsoft <- cbind(totalsoft, totalsoft_freq_pct))%>%
knitr::kable(caption = "Percentage of Soft Skills ")%>%
kableExtra::kable_styling(bootstrap_options = "striped")| Soft Skills | freq | totalsoft_freq_pct |
|---|---|---|
| leadership | 601 | 89.97 |
| collaborate | 466 | 69.76 |
| professional | 400 | 59.88 |
| communication | 356 | 53.29 |
| veteran | 265 | 39.67 |
| innovative | 264 | 39.52 |
| passionate | 173 | 25.90 |
| problem solving | 135 | 20.21 |
| creative | 124 | 18.56 |
| collaboration | 91 | 13.62 |
| integrity | 79 | 11.83 |
| pioneering | 52 | 7.78 |
| effectiveness | 37 | 5.54 |
| interpersonal skills | 34 | 5.09 |
| teamwork | 25 | 3.74 |
| critical thinking | 20 | 2.99 |
| time management | 16 | 2.40 |
#bar graph of top 10 soft skills as percentage
totalsoft %>%
top_n(10) %>%
ggplot(aes(y=reorder(`Soft Skills`,freq),x=freq,fill=`Soft Skills`)) +
geom_bar(stat = 'identity',position=position_dodge()) +
geom_text(aes(label=totalsoft_freq_pct), vjust=1.0, color="black",
position = position_dodge(0.9), size=3.0) +
labs(y = ("Soft Skills"),x = ("Soft Skill Frequency"),
title = ("Percentage of Soft Skills Found in Data Science Job Opps")) +
theme_minimal()## Selecting by totalsoft_freq_pct
Word cloud reflects that python,sql,and r is a crucial skill for a data scientist to progress in this field. For soft skills leadership,communication and collaboration seems prominent.The Tree Map has identified leadership, collaboration, communication and professional skill as important soft skill to excel as a data scientist.
Hard and Tech Skills
set.seed(1234) # for reproducibility
wordcloud(words =totaltech$`Hard and Tech Skills` ,
freq = totaltech$freq, min.freq = 1,max.words=200,
random.order=FALSE,
rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Soft Skills
set.seed(1234) # for reproducibility
wordcloud(words =totalsoft$`Soft Skills` ,
freq = totalsoft$freq, min.freq = 1,max.words=200,
random.order=FALSE,
rot.per=0.35,
colors=brewer.pal(7, "Accent"))Tech Skills
treemap(totaltech,
index=c("Hard and Tech Skills"),
vSize = "freq",
type="index",
palette ="Set1",
title="Desireable Data Scientist Tech Skills", #Customize your title
fontsize.title = 14 #Change the font size of the title
)Soft Skills
treemap(totalsoft, #
index=c("Soft Skills"),
vSize = "freq",
type="index",
palette ="Set2",
title="Desireable Data Scientist Soft Skills",
fontsize.title = 14
)To conclude our findings in this project, we can clearly see from the charts in our analyses that some of the most valuable skills for data scientists’ are leadership, communication, SQL, statistics, and python. We also see from the job locations that the largest concentration of jobs for American data scientists are in the Northeast–particularly near New York.
The foundation of this project is built upon a relational database contained within SQL, in tandem with in house programming that allowed us to read and import data into R, as well as save it to our Github seamlessly.
From there, we created an N-Gram function that allowed us to analyze text contained within the job descriptions. Finally, utilizing the N-Gram function, we performed analyses on the data to see which skills were most valuable. Plotting our data onto graphs allowed us to easily understand and interpret our results in different ways, while also being very aesthetically pleasing to see.
From our experience working together on this project, we can confidently say that leadership and communication were the two most important soft skills. Thanks these, we made this project a success despite varying skill levels and backgrounds because we all shared what was on our minds, with a great sense of direction of what to do next. As far as hard skills are concerned, the most important skills were building and designing databases, importing from the database into R, and performing the analyses.
This project was a pleasure to do as we all learned from each other and expanded our skill sets simultaneously. We now understand as witnesses that soft skills are just as important as the hard skills. This is because the soft skills allowed us to get the work done in an effective manner, which is the most important goal for any project.