Main Title

Assignment Requirements

This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists.

Please note especially the requirement about making a presentation during our first meetup after the project is due.

    W. Edwards Deming said, “In God we trust, all others must bring data.” 

Please use data to answer the question:

    “Which are the most valued data science skills?”

Consider your work as an exploration: there is not necessarily a “right answer.”

Grading rubric:
  • You will need to determine what tool(s) you’ll use as a group to effectively collaborate, share code and any project documentation (such as motivation, approach, findings).
  • You will have to determine what data to collect, where the data can be found, and how to load it.
  • The data that you decide to collect should reside in a relational database, in a set of normalized tables.
  • You should perform any needed tidying, transformations, and exploratory data analysis in R.
  • Your deliverable should include all code, results, and documentation of your motivation, approach, and findings.
  • As a group, you should appoint \((at\ least)\) three people to lead parts of the presentation.
  • While you are strongly encouraged (and will hopefully find it fun) to try out statistics and data models, your grade will NOT be affected by the statistical analysis and modeling performed (since this is a semester one course on Data Acquisition and Management).
  • Every student must be prepared to explain how the data was collected, loaded, transformed, tidied, and analyzed for outliers, etc. in our Meetup.
    • This is the **only way I’ll have to determine that everyone actively participated in the process, so you need to hold yourself responsible for understanding what your class-size team did!*
  • If you are unable to attend the meet up, then you need to either present to me one-on-one before the meetup presentation, or post a 3 to 5 minute video (e.g. on YouTube) explaining the process.
    • Individual students will not be responsible for explaining any forays into statistical analysis, modeling, data mining, regression, decision trees, etc.
    • You are encouraged to start early, ask many questions, actively post on the provided discussion forum, etc.

Database Creation, Import & Export

Connect to local Database

Database job is created on each team members local machine using MySQL. Using an Open Data Base Connectivity (ODBC) Driver. A connection is established using the Data Source Name (DSN) linking R to the appropriate user with SELECT and INSERT privileges. All files are made available on a shared githube repository DATA607_CUNY_2021_Project3

  db_conn <- dbConnect(odbc(),"data607")
    dbListTables(db_conn)
## [1] "job_opening_tbl" "location_dim"
# table location_dim fields are listed below
dbListFields(db_conn, "location_dim")
## [1] "uniq_id"  "location" "city"     "country"  "state"    "zip_code"
dbListFields(db_conn, "job_opening_tbl")
## [1] "uniq_id"   "job_title" "job_descr"

Load CSV

Function OD_DL_csv is created in order to access the necessary CSV files containing over 30,000 observations. This method of was select over loading from Github, due to storage limitations with files this large. shared file URL and csv names are stored into variables url_location, url_job_openings, csv_location and csv_openings respectively.

OD_DL_csv takes the shared file, splits it into a list, uses unlist so further split the elements and them pastes for a downloadable version. from there curl_download from the curl packages is used to complete the download.

OD_DL_csv <- function(sharedURL, file_name, save2wd = FALSE){

   # Save the shared url 
   URL1 <- unlist(strsplit(sharedURL,"[?]"))[1]
   URL1 <- paste0(URL1,"?download=1") # edit URL to make it a downloadable link
   
   # Download the file to a temp directory using the supplied file name
   curl::curl_download(
      URL1,
      destfile = file.path(tempdir(), file_name),
      mode = "wb"
      )


   # If the user wants it saved to thier working directory this will copy the file
   if(isTRUE(save2wd)){
      file.copy(
         from = paste0(tempdir(),"\\", file_name),
         to = "./")
      }

   # return the CSV as a data.frame
   return(read.csv(paste0(tempdir(), "\\" ,file_name), stringsAsFactors = FALSE))

}

Below the variables to be used in retrieve our data is stored.

url_Cleaned_DS <- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EdcuUpO6t1RIqyHw-OP90UQB2qRYY3fcW5RPyA_xT348Qw?e=4HeXbe"

url_DS_job_market<- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/Eai1ECgMhthLmrut-eIdD5oBw4715IvolGgyG54owY0nWg?e=D0iRKm"

url_location<-
   "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EYAX_5NCc_hFtSUFY0W4NPMBEXiRYtgro4V3vrSJbpTcag?e=aRKrpf"

url_job_openings<-
   "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EWni58F8o9BKqLwvF-FDdIEBX7nGv3l7PeueVkrUEPhJ_A?e=9hOyGN"

csv_locations<-
   "job_locations.csv"

csv_job_openings<-
   "job_openings.csv"

csv_DS_job_market <-
  "alldata.csv"

csv_Cleaned_DS<- 
  "Cleaned_DS_Jobs.csv"

Calling the OD_DL_csv() function the csv files are loaded to location_dim_df and job_opening_df

location_dim_df <- OD_DL_csv(sharedURL = url_location,
          file_name = csv_locations
          )
job_opening_df <- OD_DL_csv(sharedURL = url_job_openings,
          file_name = csv_job_openings
          )
# additional un tidy data set
DS_job_df <- OD_DL_csv(sharedURL = url_DS_job_market,
          file_name = csv_DS_job_market
          )

Manipulating Imported CSV Dataset

filter() is used to select only the rows associated with the word Data.

# to filter 30,000 rows in rmd and lighten the load to MySQL
job_opening_df <- job_opening_df %>%
  filter(grepl('Data',job_title))
# to filter 30,000 rows in rmd and lighten the load to MySQL
location_dim_df <- location_dim_df %>%
  filter(uniq_id %in% job_opening_df$uniq_id)

gsub() along with a specific regex command formats the HTML formatting to a plain text form.

job_opening_df$job_descr <- gsub("<[^>]+>","",job_opening_df$job_descr)

By taking the existing column names and adding 40,000 to them, the unique_id column is created and made outside of the range of our previous data frames, ensuring there are no redundancies.

# add unique ID column based on row name of each entry
DS_job_df<-cbind( data.frame("uniq_id" = as.integer(rownames(DS_job_df))), DS_job_df)

# add 40K to each uniq_id to ensure none match the first 30K entries of the other sets
DS_job_df<-DS_job_df %>%
  mutate_at( vars("uniq_id") ,
             funs(.+40000))
# splitting up the location column to create state, city, zip_code
temp<-as.data.frame(str_split_fixed(DS_job_df$location,",",2))
temp$V2<-str_trim(temp$V2)
temp$V2<-str_replace_all(temp$V2, "\\s", " == ")
DS_job_df<-(temp<-cbind(DS_job_df,"city" = temp$V1,
            as.data.frame(str_split_fixed(temp$V2,"==",2)))%>%
                                              dplyr::rename("state"=V1, "zip_code"=V2))
# loading country column with a constan "United State" value
DS_job_df$country<-"United States"

# quick rearrange
DS_job_df<-DS_job_df %>%
  select(uniq_id,position,description,location,city,state,country,zip_code,company,reviews)
job_opening_df <-rbind(DS_job_df %>% 
                         select(1,2,3) %>%
                            dplyr::rename( "uniq_id" = `uniq_id`,
                            "job_title" = `position`,
                            "job_descr" = `description`)
                      ,job_opening_df)
location_dim_df <-rbind(DS_job_df %>% 
                         select(1,4,5,6,7,8),location_dim_df)

Write to Database

Using dbWriteTable function, we are able to write these imported dataframes into the appropriate Database tables on our local Database.

The database consists of 2 tables in 2nd Normal Form as shown below

ER Diagram

dbWriteTable(db_conn, name = "location_dim", value = location_dim_df, append = TRUE,row.names = FALSE)
dbWriteTable(db_conn, name = "job_opening_tbl", value = job_opening_df, append = TRUE, row.names = FALSE)

An SQL in the Project 3 Repo is used, to retreive the merged data, filtered by job titles, containing the word \(DATA\).

db_sql <- read_file(filename)
db_sql <- gsub("\n", " ",db_sql)
str(db_sql)
##  chr "SELECT j.uniq_id, j.job_title, j.job_descr, l.location FROM jobs.job_opening_tbl j, jobs.location_dim l where j"| __truncated__
result_set = dbGetQuery(db_conn, db_sql)
class(result_set)
## [1] "data.frame"

Analysis

N-Gram function

phrases <- result_set$job_descr
#view(phrases)

createNgram <-function(stringVector, ngramSize){

  ngram <- data.table()

  ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

  if(ngramSize==1){
    ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))  
  }
  else {
    ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
  }
  return(ngram)
}
text <- phrases
res <- createNgram(text,1)
names(res)
## [1] "w1"     "freq"   "length"
head(res %>% arrange(desc(freq)),20)%>%
  knitr::kable(caption = "Projection of Word Frequency ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
## Warning in kableExtra::kable_styling(., bootstrap_options = "striped"): Please
## specify format in kable. kableExtra can customize either HTML or LaTeX outputs.
## See https://haozhu233.github.io/kableExtra/ for details.
Projection of Word Frequency
w1 freq length
and 18118 3
to 9548 2
of 7340 2
the 7333 3
a 5924 1
in 5570 2
data 5557 4
with 4827 4
for 3371 3
or 3350 2
is 2302 2
our 1986 3
as 1882 2
experience 1746 10
on 1740 2
business 1646 8
will 1641 4
that 1449 4
team 1448 4
are 1426 3

Plotting zip codes

top_states <- location_dim_df$zip_code

zip.plot(location_dim_df, zip.file = system.file("extdata", "zips.tab", package = 
"muRL"), map.type = "usa", cex = 1, col = "orange", pch = 10, 
jitter.factor = NULL)

Frequency Count

Hard and Tech Skills

Using the N-gram function which can extract single, double or more word combinations, we extracted the most common hard, technical, and soft skills we know from prior experiences and from the articles linked below. Below are the Top 20 Hard and Technical Skills in order of frequency.
The Most In-Demand Tech Skills for Data Scientists
Data Scientist Resume Sample Template & Data-Driven Guide
We used the N-gram function twice to extract single word skills, and then once again to extract double worded skills such as "quantitative analysis. In conjunction with the N-gram function, we leveraged dplyr functions to tidy our data, ggplot for the bar graphs and reactable for the table visual of our data.

# hard skills: 
#https://towardsdatascience.com/the-most-in-demand-tech-skills-for-data-scientists-d716d10c191d
# technical, hard and soft skills:
#https://zety.com/blog/data-scientist-resume-example
# res1$w1 = tolower(res1s$w1) sets all words to lowercase
job_descr_phrases  <- result_set$job_descr
job_descr_text     <- job_descr_phrases
job_phrases_1      <- createNgram(job_descr_text,1)
job_phrases_1$w1   <- tolower(job_phrases_1$w1) # convert to lowercase for accurate counting
job_phrases_2      <- createNgram(job_descr_text,2)
job_phrases_2$w1w2 <- tolower(job_phrases_2$w1w2)
# hard and tech skills single word skills
techskill = c("python", "R", "r","sql",  "spark",  "hadoop","java",
              "tableau", "aws", "sas", "hive",  "tensorflow", "scala",
              "c++", "excel", "azure", "mathematics", "statistitcs",
              "programming",  "debugging",  "probability", "modeling",
              "matplotlib", "openrefine", "matlab", "bigml", "d3.js",
              "excel","ggplot2","jupyter","nltk","scikit-learn","tensorflow",
              "weka","predictive","ai","mathematics","c", "linux","nosql",
              "basic","bayesian","mapreduce")

hardskill= c("data visualization", "quantitative analysis", "data analysis",
             "artificial intelligence", "predictive analysis", "predictive modeling",
             "bachelors degree", "masters degree")
job_phrases_1<- 
  job_phrases_1 %>% 
    select(-3) %>% 
      filter(w1 %in% techskill) %>% 
        group_by(w1) %>% 
          summarise_all(funs(sum)) %>% 
            arrange(desc(freq))

colnames(job_phrases_1)[1] <- "Hard and Tech Skills"
#hard and tech double word skills
job_phrases_2<- 
  job_phrases_2 %>%
    select(-3) %>% 
      filter(w1w2 %in% hardskill) %>% 
        group_by(w1w2) %>% 
          summarise_all(funs(sum))%>% 
            arrange(desc(freq))

colnames(job_phrases_2)[1] <- "Hard and Tech Skills"
#combine the single word and two word skills into one dataframe
totaltech <- 
  rbind(job_phrases_1, job_phrases_2) %>% 
    arrange(desc(freq))

head(totaltech)%>%
  knitr::kable(caption = "Tech Skills for Data Science ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Tech Skills for Data Science
Hard and Tech Skills freq
python 531
sql 481
modeling 432
r 426
programming 284
spark 245
#bar graph top 20 hard and tech skills
totaltech %>% 
  top_n(20) %>% 
    ggplot(aes(fct_reorder(`Hard and Tech Skills`,`freq`), `freq`))+
      geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
      coord_flip() +
        xlab("") +
        ylab("Frequency")+
        ggtitle("Top 20 Hard and Tech Skills")
## Selecting by freq

reactable(totaltech, bordered = TRUE, striped = TRUE,
          highlight = TRUE, filterable = TRUE,  showPageSizeOptions = TRUE,
          showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
          defaultPageSize = 10)
Soft Skills

Note, for soft skills we extracted words similar to one another such as lead, leadership, and leader and then manually grouped them together since they all are suggestive of the same skill. The soft skills were extracted and orgaized in a similar fashion to the above hard and technical skills.

#soft skills single word
job_phrases_3 <- createNgram(job_descr_text,1)
job_phrases_3$w1 <- tolower(job_phrases_3$w1) 
softskill = c("communication", "collaboration","teamwork", "collaborate",
              "professional", "veteran", "lead", "leadership", "leader",
              "innovation", "innovate", "innovative", "collaborative", 
              "passionate", "creative", "motivated", "integrity", "effectiveness",
              "pioneering","communication", "collaboration", "preceptiveness",
              "perseptive","teamwork","collaborate")
job_phrases_4 <- createNgram(job_descr_text,2)
job_phrases_4$w1w2 <- tolower(job_phrases_4$w1w2)
softskill2 = c("critical thinking", "problem solving","interpersonal skills",
               "time management")
job_phrases_3<- 
  job_phrases_3 %>% 
    select(-3) %>% 
      filter(w1 %in% softskill) %>% 
        group_by(w1) %>% 
          summarise_all(funs(sum)) %>% 
            arrange(desc(freq))

colnames(job_phrases_3)[1] <- "Soft Skills"
#soft skills double word
job_phrases_4<- job_phrases_4 %>% 
  select(-3) %>% 
  filter(w1w2 %in% softskill2) %>% 
  group_by(w1w2) %>% 
  summarise_all(funs(sum)) %>% 
  arrange(desc(freq))
colnames(job_phrases_4)[1] <- "Soft Skills"
#combine the single word and two word skills into one dataframe
totalsoft <-
  rbind(job_phrases_3, job_phrases_4) %>% 
    arrange(desc(freq)) #arranges in descending order

totalsoft<-
  totalsoft[-c(2,10), ]
totalsoft[7,2] = (302+158+141)

totalsoft<-
  totalsoft[-c(8,15), ]

totalsoft[4,2] = (264+149+53)

totalsoft<-
  totalsoft[-c(12,10), ]
totalsoft[5,2] = (203+106+91)

totalsoft<- arrange(totalsoft,desc(freq))

head(totalsoft)%>%
  knitr::kable(caption = "Soft skills for Data Science jos ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Soft skills for Data Science jos
Soft Skills freq
leadership 601
collaborate 466
professional 400
communication 356
veteran 265
innovative 264
#bar graph top 10 soft skills
totalsoft %>% 
  top_n(10) %>%  #extract top 10
    ggplot(aes(fct_reorder(`Soft Skills`,`freq`), `freq`))+
      geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
      coord_flip() +
      xlab("") +
      ylab("Frequency")+
      ggtitle("Top Ten Soft Skills")
## Selecting by freq

reactable(totalsoft, bordered = TRUE, striped = TRUE, highlight = TRUE, filterable = TRUE,  showPageSizeOptions = TRUE, showPagination = TRUE, pageSizeOptions = c(5, 10), defaultPageSize = 5)

Percentages

#calc the percentage of hard and tech skills

totaltech_freq_pct <- round((totaltech$freq/nrow(result_set))*100, 2)
(totaltech <- cbind(totaltech, totaltech_freq_pct))%>%
  knitr::kable(caption = "Percentage of Hard and Tech Skills ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Percentage of Hard and Tech Skills
Hard and Tech Skills freq totaltech_freq_pct
python 531 79.49
sql 481 72.01
modeling 432 64.67
r 426 63.77
programming 284 42.51
spark 245 36.68
mathematics 231 34.58
ai 228 34.13
predictive 228 34.13
hadoop 222 33.23
data analysis 204 30.54
java 196 29.34
c 172 25.75
hive 163 24.40
data visualization 132 19.76
tableau 124 18.56
sas 122 18.26
scala 105 15.72
azure 103 15.42
basic 95 14.22
excel 91 13.62
linux 86 12.87
artificial intelligence 83 12.43
aws 78 11.68
predictive modeling 71 10.63
matlab 70 10.48
nosql 66 9.88
tensorflow 61 9.13
mapreduce 60 8.98
quantitative analysis 29 4.34
bayesian 26 3.89
probability 20 2.99
jupyter 14 2.10
masters degree 12 1.80
nltk 8 1.20
bachelors degree 8 1.20
debugging 7 1.05
predictive analysis 6 0.90
matplotlib 5 0.75
weka 5 0.75

#bar graph of top 15 hard and tech skills as percentage

totaltech %>%
  top_n(15) %>%
    ggplot(aes(y=reorder(`Hard and Tech Skills`,freq),x=freq,fill=`Hard and Tech Skills`)) +
      geom_bar(stat = 'identity',position=position_dodge()) +
      geom_text(aes(label=totaltech_freq_pct), vjust=1.0, color="black",
                position = position_dodge(0.9), size=3.0) +
      labs(y = ("Hard and Tech Skills"),x = ("Hard and Tech Skill Count"),
           title = ("Percentage of Hard/Technical Skills Found in Data Science Job Opps")) +
      theme_minimal()
## Selecting by totaltech_freq_pct

totalsoft_freq_pct <- round((totalsoft$freq/nrow(result_set))*100, 2)

(totalsoft <- cbind(totalsoft, totalsoft_freq_pct))%>%
  knitr::kable(caption = "Percentage of Soft Skills ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")
Percentage of Soft Skills
Soft Skills freq totalsoft_freq_pct
leadership 601 89.97
collaborate 466 69.76
professional 400 59.88
communication 356 53.29
veteran 265 39.67
innovative 264 39.52
passionate 173 25.90
problem solving 135 20.21
creative 124 18.56
collaboration 91 13.62
integrity 79 11.83
pioneering 52 7.78
effectiveness 37 5.54
interpersonal skills 34 5.09
teamwork 25 3.74
critical thinking 20 2.99
time management 16 2.40

#bar graph of top 10 soft skills as percentage

totalsoft %>%
  top_n(10) %>%
    ggplot(aes(y=reorder(`Soft Skills`,freq),x=freq,fill=`Soft Skills`)) +
      geom_bar(stat = 'identity',position=position_dodge()) +
      geom_text(aes(label=totalsoft_freq_pct), vjust=1.0, color="black", 
            position = position_dodge(0.9), size=3.0) +
      labs(y = ("Soft Skills"),x = ("Soft Skill Frequency"),
           title = ("Percentage of Soft Skills Found in Data Science Job Opps")) +
      theme_minimal()
## Selecting by totalsoft_freq_pct

Word cloud and TreeMap

Word cloud reflects that python,sql,and r is a crucial skill for a data scientist to progress in this field.For soft skills leadership,communication and collaboration seems prominent.The Tree Map has identified leadership, collaboration, communication and professional skill as important soft skill to excel as a data scientist.
set.seed(1234) # for reproducibility 
wordcloud(words =totaltech$`Hard and Tech Skills` ,
          freq = totaltech$freq, min.freq = 1,max.words=200,
          random.order=FALSE,
          rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

set.seed(1234) # for reproducibility 
wordcloud(words =totalsoft$`Soft Skills` ,
          freq = totalsoft$freq, min.freq = 1,max.words=200,
          random.order=FALSE,
          rot.per=0.35,
          colors=brewer.pal(7, "Accent"))

TreeMap
Tech Skills
treemap(totaltech, 
        index=c("Hard and Tech Skills"),  
        vSize = "freq",  
        type="index", 
        palette ="Set1",
        title="Desireable Data Scientist Tech Skills", #Customize your title
        fontsize.title = 14 #Change the font size of the title
        )

Soft Skills
treemap(totalsoft, #
        index=c("Soft Skills"), 
        vSize = "freq",  
        type="index", 
        palette ="Set2",   
        title="Desireable Data Scientist Soft Skills", 
        fontsize.title = 14 
        )

Conclusion

To conclude our findings in this project, we can clearly see from the charts in our analyses that some of the most valuable skills for data scientists’ are leadership, communication, SQL, statistics, and python. We also see from the job locations that the largest concentration of jobs for American data scientists are in the Northeast–particularly near New York.

The foundation of this project is built upon a relational database contained within SQL, in tandem with in house programming that allowed us to read and import data into R, as well as save it to our Github seamlessly.
From there, we created an N-Gram function that allowed us to analyze text contained within the job descriptions. Finally, utilizing the N-Gram function, we performed analyses on the data to see which skills were most valuable. Plotting our data onto graphs allowed us to easily understand and interpret our results in different ways, while also being very aesthetically pleasing to see.

From our experience working together on this project, we can confidently say that leadership and communication were the two most important soft skills. Thanks these, we made this project a success despite varying skill levels and backgrounds because we all shared what was on our minds, with a great sense of direction of what to do next. As far as hard skills are concerned, the most important skills were building and designing databases, importing from the database into R, and performing the analyses.

This project was a pleasure to do as we all learned from each other and expanded our skill sets simultaneously. We now understand as witnesses that soft skills are just as important as the hard skills. This is because the soft skills allowed us to get the work done in an effective manner, which is the most important goal for any project.