DATA607 Project 3 – Data Science Skills

Main Title

Assignment Requirements

This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists.

Please note especially the requirement about making a presentation during our first meetup after the project is due.

    W. Edwards Deming said, “In God we trust, all others must bring data.”

Please use data to answer the question:

    “Which are the most valued data science skills?”

Consider your work as an exploration: there is not necessarily a “right answer.”

Grading rubric:

You will need to determine what tool(s) you’ll use as a group to effectively collaborate, share code and any project documentation (such as motivation, approach, findings).
You will have to determine what data to collect, where the data can be found, and how to load it.
The data that you decide to collect should reside in a relational database, in a set of normalized tables.
You should perform any needed tidying, transformations, and exploratory data analysis in R.
Your deliverable should include all code, results, and documentation of your motivation, approach, and findings.
As a group, you should appoint \((at\ least)\) three people to lead parts of the presentation.
While you are strongly encouraged (and will hopefully find it fun) to try out statistics and data models, your grade will NOT be affected by the statistical analysis and modeling performed (since this is a semester one course on Data Acquisition and Management).
Every student must be prepared to explain how the data was collected, loaded, transformed, tidied, and analyzed for outliers, etc. in our Meetup.
- This is the **only way I’ll have to determine that everyone actively participated in the process, so you need to hold yourself responsible for understanding what your class-size team did!*
If you are unable to attend the meet up, then you need to either present to me one-on-one before the meetup presentation, or post a 3 to 5 minute video (e.g. on YouTube) explaining the process.
- Individual students will not be responsible for explaining any forays into statistical analysis, modeling, data mining, regression, decision trees, etc.
- You are encouraged to start early, ask many questions, actively post on the provided discussion forum, etc.

Database Creation, Import & Export

Connect to local Database

Database job is created on each team members local machine using MySQL. Using an Open Data Base Connectivity (ODBC) Driver. A connection is established using the Data Source Name (DSN) linking R to the appropriate user with SELECT and INSERT privileges. All files are made available on a shared githube repository DATA607_CUNY_2021_Project3

  db_conn <- dbConnect(odbc(),"data607")
    dbListTables(db_conn)

## [1] "job_opening_tbl" "location_dim"

# table location_dim fields are listed below
dbListFields(db_conn, "location_dim")

## [1] "uniq_id"  "location" "city"     "country"  "state"    "zip_code"

dbListFields(db_conn, "job_opening_tbl")

## [1] "uniq_id"   "job_title" "job_descr"

Load CSV

Function OD_DL_csv is created in order to access the necessary CSV files containing over 30,000 observations. This method of was select over loading from Github, due to storage limitations with files this large. shared file URL and csv names are stored into variables url_location, url_job_openings, csv_location and csv_openings respectively.

OD_DL_csv takes the shared file, splits it into a list, uses unlist so further split the elements and them pastes for a downloadable version. from there curl_download from the curl packages is used to complete the download.

OD_DL_csv <- function(sharedURL, file_name, save2wd = FALSE){

   # Save the shared url 
   URL1 <- unlist(strsplit(sharedURL,"[?]"))[1]
   URL1 <- paste0(URL1,"?download=1") # edit URL to make it a downloadable link
   
   # Download the file to a temp directory using the supplied file name
   curl::curl_download(
      URL1,
      destfile = file.path(tempdir(), file_name),
      mode = "wb"
      )


   # If the user wants it saved to thier working directory this will copy the file
   if(isTRUE(save2wd)){
      file.copy(
         from = paste0(tempdir(),"\\", file_name),
         to = "./")
      }

   # return the CSV as a data.frame
   return(read.csv(paste0(tempdir(), "\\" ,file_name), stringsAsFactors = FALSE))

}

Below the variables to be used in retrieve our data is stored.

url_Cleaned_DS <- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EdcuUpO6t1RIqyHw-OP90UQB2qRYY3fcW5RPyA_xT348Qw?e=4HeXbe"

url_DS_job_market<- "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/Eai1ECgMhthLmrut-eIdD5oBw4715IvolGgyG54owY0nWg?e=D0iRKm"

url_location<-
   "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EYAX_5NCc_hFtSUFY0W4NPMBEXiRYtgro4V3vrSJbpTcag?e=aRKrpf"

url_job_openings<-
   "https://cuny-my.sharepoint.com/:x:/g/personal/gabriel_campos77_qmail_cuny_edu/EWni58F8o9BKqLwvF-FDdIEBX7nGv3l7PeueVkrUEPhJ_A?e=9hOyGN"

csv_locations<-
   "job_locations.csv"

csv_job_openings<-
   "job_openings.csv"

csv_DS_job_market <-
  "alldata.csv"

csv_Cleaned_DS<- 
  "Cleaned_DS_Jobs.csv"

Calling the OD_DL_csv() function the csv files are loaded to location_dim_df and job_opening_df

location_dim_df <- OD_DL_csv(sharedURL = url_location,
          file_name = csv_locations
          )

job_opening_df <- OD_DL_csv(sharedURL = url_job_openings,
          file_name = csv_job_openings
          )

# additional un tidy data set
DS_job_df <- OD_DL_csv(sharedURL = url_DS_job_market,
          file_name = csv_DS_job_market
          )

Manipulating Imported CSV Dataset

filter() is used to select only the rows associated with the word Data.

# to filter 30,000 rows in rmd and lighten the load to MySQL
job_opening_df <- job_opening_df %>%
  filter(grepl('Data',job_title))

# to filter 30,000 rows in rmd and lighten the load to MySQL
location_dim_df <- location_dim_df %>%
  filter(uniq_id %in% job_opening_df$uniq_id)

gsub() along with a specific regex command formats the HTML formatting to a plain text form.

job_opening_df$job_descr <- gsub("<[^>]+>","",job_opening_df$job_descr)

By taking the existing column names and adding 40,000 to them, the unique_id column is created and made outside of the range of our previous data frames, ensuring there are no redundancies.

# add unique ID column based on row name of each entry
DS_job_df<-cbind( data.frame("uniq_id" = as.integer(rownames(DS_job_df))), DS_job_df)

# add 40K to each uniq_id to ensure none match the first 30K entries of the other sets
DS_job_df<-DS_job_df %>%
  mutate_at( vars("uniq_id") ,
             funs(.+40000))

# splitting up the location column to create state, city, zip_code
temp<-as.data.frame(str_split_fixed(DS_job_df$location,",",2))
temp$V2<-str_trim(temp$V2)
temp$V2<-str_replace_all(temp$V2, "\\s", " == ")
DS_job_df<-(temp<-cbind(DS_job_df,"city" = temp$V1,
            as.data.frame(str_split_fixed(temp$V2,"==",2)))%>%
                                              dplyr::rename("state"=V1, "zip_code"=V2))
# loading country column with a constan "United State" value
DS_job_df$country<-"United States"

# quick rearrange
DS_job_df<-DS_job_df %>%
  select(uniq_id,position,description,location,city,state,country,zip_code,company,reviews)

job_opening_df <-rbind(DS_job_df %>% 
                         select(1,2,3) %>%
                            dplyr::rename( "uniq_id" = `uniq_id`,
                            "job_title" = `position`,
                            "job_descr" = `description`)
                      ,job_opening_df)

location_dim_df <-rbind(DS_job_df %>% 
                         select(1,4,5,6,7,8),location_dim_df)

Write to Database

Using dbWriteTable function, we are able to write these imported dataframes into the appropriate Database tables on our local Database.

The database consists of 2 tables in 2nd Normal Form as shown below

ER Diagram

dbWriteTable(db_conn, name = "location_dim", value = location_dim_df, append = TRUE,row.names = FALSE)

dbWriteTable(db_conn, name = "job_opening_tbl", value = job_opening_df, append = TRUE, row.names = FALSE)

An SQL in the Project 3 Repo is used, to retreive the merged data, filtered by job titles, containing the word \(DATA\).

db_sql <- read_file(filename)
db_sql <- gsub("\n", " ",db_sql)
str(db_sql)

##  chr "SELECT j.uniq_id, j.job_title, j.job_descr, l.location FROM jobs.job_opening_tbl j, jobs.location_dim l where j"| __truncated__

result_set = dbGetQuery(db_conn, db_sql)

class(result_set)

## [1] "data.frame"

Analysis

N-Gram function

phrases <- result_set$job_descr
#view(phrases)

createNgram <-function(stringVector, ngramSize){

  ngram <- data.table()

  ng <- textcnt(stringVector, method = "string", n=ngramSize, tolower = FALSE)

  if(ngramSize==1){
    ngram <- data.table(w1 = names(ng), freq = unclass(ng), length=nchar(names(ng)))  
  }
  else {
    ngram <- data.table(w1w2 = names(ng), freq = unclass(ng), length=nchar(names(ng)))
  }
  return(ngram)
}

text <- phrases
res <- createNgram(text,1)
names(res)

## [1] "w1"     "freq"   "length"

head(res %>% arrange(desc(freq)),20)%>%
  knitr::kable(caption = "Projection of Word Frequency ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")

## Warning in kableExtra::kable_styling(., bootstrap_options = "striped"): Please
## specify format in kable. kableExtra can customize either HTML or LaTeX outputs.
## See https://haozhu233.github.io/kableExtra/ for details.

Projection of Word Frequency
w1	freq	length
and	18118	3
to	9548	2
of	7340	2
the	7333	3
a	5924	1
in	5570	2
data	5557	4
with	4827	4
for	3371	3
or	3350	2
is	2302	2
our	1986	3
as	1882	2
experience	1746	10
on	1740	2
business	1646	8
will	1641	4
that	1449	4
team	1448	4
are	1426	3

Plotting zip codes

top_states <- location_dim_df$zip_code

zip.plot(location_dim_df, zip.file = system.file("extdata", "zips.tab", package = 
"muRL"), map.type = "usa", cex = 1, col = "orange", pch = 10, 
jitter.factor = NULL)

Frequency Count

Hard and Tech Skills

Using the N-gram function which can extract single, double or more word combinations, we extracted the most common hard, technical, and soft skills we know from prior experiences and from the articles linked below. Below are the Top 20 Hard and Technical Skills in order of frequency.
The Most In-Demand Tech Skills for Data Scientists
Data Scientist Resume Sample Template & Data-Driven Guide
We used the N-gram function twice to extract single word skills, and then once again to extract double worded skills such as "quantitative analysis. In conjunction with the N-gram function, we leveraged dplyr functions to tidy our data, ggplot for the bar graphs and reactable for the table visual of our data.

# hard skills: 
#https://towardsdatascience.com/the-most-in-demand-tech-skills-for-data-scientists-d716d10c191d
# technical, hard and soft skills:
#https://zety.com/blog/data-scientist-resume-example
# res1$w1 = tolower(res1s$w1) sets all words to lowercase
job_descr_phrases  <- result_set$job_descr
job_descr_text     <- job_descr_phrases
job_phrases_1      <- createNgram(job_descr_text,1)
job_phrases_1$w1   <- tolower(job_phrases_1$w1) # convert to lowercase for accurate counting
job_phrases_2      <- createNgram(job_descr_text,2)
job_phrases_2$w1w2 <- tolower(job_phrases_2$w1w2)
# hard and tech skills single word skills
techskill = c("python", "R", "r","sql",  "spark",  "hadoop","java",
              "tableau", "aws", "sas", "hive",  "tensorflow", "scala",
              "c++", "excel", "azure", "mathematics", "statistitcs",
              "programming",  "debugging",  "probability", "modeling",
              "matplotlib", "openrefine", "matlab", "bigml", "d3.js",
              "excel","ggplot2","jupyter","nltk","scikit-learn","tensorflow",
              "weka","predictive","ai","mathematics","c", "linux","nosql",
              "basic","bayesian","mapreduce")

hardskill= c("data visualization", "quantitative analysis", "data analysis",
             "artificial intelligence", "predictive analysis", "predictive modeling",
             "bachelors degree", "masters degree")

job_phrases_1<- 
  job_phrases_1 %>% 
    select(-3) %>% 
      filter(w1 %in% techskill) %>% 
        group_by(w1) %>% 
          summarise_all(funs(sum)) %>% 
            arrange(desc(freq))

colnames(job_phrases_1)[1] <- "Hard and Tech Skills"

#hard and tech double word skills
job_phrases_2<- 
  job_phrases_2 %>%
    select(-3) %>% 
      filter(w1w2 %in% hardskill) %>% 
        group_by(w1w2) %>% 
          summarise_all(funs(sum))%>% 
            arrange(desc(freq))

colnames(job_phrases_2)[1] <- "Hard and Tech Skills"

#combine the single word and two word skills into one dataframe
totaltech <- 
  rbind(job_phrases_1, job_phrases_2) %>% 
    arrange(desc(freq))

head(totaltech)%>%
  knitr::kable(caption = "Tech Skills for Data Science ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")

Tech Skills for Data Science
Hard and Tech Skills	freq
python	531
sql	481
modeling	432
r	426
programming	284
spark	245

#bar graph top 20 hard and tech skills
totaltech %>% 
  top_n(20) %>% 
    ggplot(aes(fct_reorder(`Hard and Tech Skills`,`freq`), `freq`))+
      geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
      coord_flip() +
        xlab("") +
        ylab("Frequency")+
        ggtitle("Top 20 Hard and Tech Skills")

## Selecting by freq

reactable(totaltech, bordered = TRUE, striped = TRUE,
          highlight = TRUE, filterable = TRUE,  showPageSizeOptions = TRUE,
          showPagination = TRUE, pageSizeOptions = c(5, 10, 20),
          defaultPageSize = 10)

Soft Skills

Note, for soft skills we extracted words similar to one another such as lead, leadership, and leader and then manually grouped them together since they all are suggestive of the same skill. The soft skills were extracted and orgaized in a similar fashion to the above hard and technical skills.

#soft skills single word
job_phrases_3 <- createNgram(job_descr_text,1)
job_phrases_3$w1 <- tolower(job_phrases_3$w1) 
softskill = c("communication", "collaboration","teamwork", "collaborate",
              "professional", "veteran", "lead", "leadership", "leader",
              "innovation", "innovate", "innovative", "collaborative", 
              "passionate", "creative", "motivated", "integrity", "effectiveness",
              "pioneering","communication", "collaboration", "preceptiveness",
              "perseptive","teamwork","collaborate")
job_phrases_4 <- createNgram(job_descr_text,2)
job_phrases_4$w1w2 <- tolower(job_phrases_4$w1w2)
softskill2 = c("critical thinking", "problem solving","interpersonal skills",
               "time management")

job_phrases_3<- 
  job_phrases_3 %>% 
    select(-3) %>% 
      filter(w1 %in% softskill) %>% 
        group_by(w1) %>% 
          summarise_all(funs(sum)) %>% 
            arrange(desc(freq))

colnames(job_phrases_3)[1] <- "Soft Skills"

#soft skills double word
job_phrases_4<- job_phrases_4 %>% 
  select(-3) %>% 
  filter(w1w2 %in% softskill2) %>% 
  group_by(w1w2) %>% 
  summarise_all(funs(sum)) %>% 
  arrange(desc(freq))
colnames(job_phrases_4)[1] <- "Soft Skills"

#combine the single word and two word skills into one dataframe
totalsoft <-
  rbind(job_phrases_3, job_phrases_4) %>% 
    arrange(desc(freq)) #arranges in descending order

totalsoft<-
  totalsoft[-c(2,10), ]
totalsoft[7,2] = (302+158+141)

totalsoft<-
  totalsoft[-c(8,15), ]

totalsoft[4,2] = (264+149+53)

totalsoft<-
  totalsoft[-c(12,10), ]
totalsoft[5,2] = (203+106+91)

totalsoft<- arrange(totalsoft,desc(freq))

head(totalsoft)%>%
  knitr::kable(caption = "Soft skills for Data Science jos ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")

Soft skills for Data Science jos
Soft Skills	freq
leadership	601
collaborate	466
professional	400
communication	356
veteran	265
innovative	264

#bar graph top 10 soft skills
totalsoft %>% 
  top_n(10) %>%  #extract top 10
    ggplot(aes(fct_reorder(`Soft Skills`,`freq`), `freq`))+
      geom_bar(stat="identity", fill="#f68060", alpha=.6, width=.4) +
      coord_flip() +
      xlab("") +
      ylab("Frequency")+
      ggtitle("Top Ten Soft Skills")

## Selecting by freq

reactable(totalsoft, bordered = TRUE, striped = TRUE, highlight = TRUE, filterable = TRUE,  showPageSizeOptions = TRUE, showPagination = TRUE, pageSizeOptions = c(5, 10), defaultPageSize = 5)

Percentages

#calc the percentage of hard and tech skills

totaltech_freq_pct <- round((totaltech$freq/nrow(result_set))*100, 2)
(totaltech <- cbind(totaltech, totaltech_freq_pct))%>%
  knitr::kable(caption = "Percentage of Hard and Tech Skills ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")

Percentage of Hard and Tech Skills
Hard and Tech Skills	freq	totaltech_freq_pct
python	531	79.49
sql	481	72.01
modeling	432	64.67
r	426	63.77
programming	284	42.51
spark	245	36.68
mathematics	231	34.58
ai	228	34.13
predictive	228	34.13
hadoop	222	33.23
data analysis	204	30.54
java	196	29.34
c	172	25.75
hive	163	24.40
data visualization	132	19.76
tableau	124	18.56
sas	122	18.26
scala	105	15.72
azure	103	15.42
basic	95	14.22
excel	91	13.62
linux	86	12.87
artificial intelligence	83	12.43
aws	78	11.68
predictive modeling	71	10.63
matlab	70	10.48
nosql	66	9.88
tensorflow	61	9.13
mapreduce	60	8.98
quantitative analysis	29	4.34
bayesian	26	3.89
probability	20	2.99
jupyter	14	2.10
masters degree	12	1.80
nltk	8	1.20
bachelors degree	8	1.20
debugging	7	1.05
predictive analysis	6	0.90
matplotlib	5	0.75
weka	5	0.75

#bar graph of top 15 hard and tech skills as percentage

totaltech %>%
  top_n(15) %>%
    ggplot(aes(y=reorder(`Hard and Tech Skills`,freq),x=freq,fill=`Hard and Tech Skills`)) +
      geom_bar(stat = 'identity',position=position_dodge()) +
      geom_text(aes(label=totaltech_freq_pct), vjust=1.0, color="black",
                position = position_dodge(0.9), size=3.0) +
      labs(y = ("Hard and Tech Skills"),x = ("Hard and Tech Skill Count"),
           title = ("Percentage of Hard/Technical Skills Found in Data Science Job Opps")) +
      theme_minimal()

## Selecting by totaltech_freq_pct

totalsoft_freq_pct <- round((totalsoft$freq/nrow(result_set))*100, 2)

(totalsoft <- cbind(totalsoft, totalsoft_freq_pct))%>%
  knitr::kable(caption = "Percentage of Soft Skills ")%>%
  kableExtra::kable_styling(bootstrap_options = "striped")

Percentage of Soft Skills
Soft Skills	freq	totalsoft_freq_pct
leadership	601	89.97
collaborate	466	69.76
professional	400	59.88
communication	356	53.29
veteran	265	39.67
innovative	264	39.52
passionate	173	25.90
problem solving	135	20.21
creative	124	18.56
collaboration	91	13.62
integrity	79	11.83
pioneering	52	7.78
effectiveness	37	5.54
interpersonal skills	34	5.09
teamwork	25	3.74
critical thinking	20	2.99
time management	16	2.40

#bar graph of top 10 soft skills as percentage

totalsoft %>%
  top_n(10) %>%
    ggplot(aes(y=reorder(`Soft Skills`,freq),x=freq,fill=`Soft Skills`)) +
      geom_bar(stat = 'identity',position=position_dodge()) +
      geom_text(aes(label=totalsoft_freq_pct), vjust=1.0, color="black", 
            position = position_dodge(0.9), size=3.0) +
      labs(y = ("Soft Skills"),x = ("Soft Skill Frequency"),
           title = ("Percentage of Soft Skills Found in Data Science Job Opps")) +
      theme_minimal()

## Selecting by totalsoft_freq_pct

Word cloud and TreeMap

Word cloud reflects that python,sql,and r is a crucial skill for a data scientist to progress in this field.For soft skills leadership,communication and collaboration seems prominent.The Tree Map has identified leadership, collaboration, communication and professional skill as important soft skill to excel as a data scientist.

set.seed(1234) # for reproducibility 
wordcloud(words =totaltech$`Hard and Tech Skills` ,
          freq = totaltech$freq, min.freq = 1,max.words=200,
          random.order=FALSE,
          rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

set.seed(1234) # for reproducibility 
wordcloud(words =totalsoft$`Soft Skills` ,
          freq = totalsoft$freq, min.freq = 1,max.words=200,
          random.order=FALSE,
          rot.per=0.35,
          colors=brewer.pal(7, "Accent"))

TreeMap

Tech Skills

treemap(totaltech, 
        index=c("Hard and Tech Skills"),  
        vSize = "freq",  
        type="index", 
        palette ="Set1",
        title="Desireable Data Scientist Tech Skills", #Customize your title
        fontsize.title = 14 #Change the font size of the title
        )

Soft Skills

treemap(totalsoft, #
        index=c("Soft Skills"), 
        vSize = "freq",  
        type="index", 
        palette ="Set2",   
        title="Desireable Data Scientist Soft Skills", 
        fontsize.title = 14 
        )

Conclusion

To conclude our findings in this project, we can clearly see from the charts in our analyses that some of the most valuable skills for data scientists’ are leadership, communication, SQL, statistics, and python. We also see from the job locations that the largest concentration of jobs for American data scientists are in the Northeast–particularly near New York.

The foundation of this project is built upon a relational database contained within SQL, in tandem with in house programming that allowed us to read and import data into R, as well as save it to our Github seamlessly.
From there, we created an N-Gram function that allowed us to analyze text contained within the job descriptions. Finally, utilizing the N-Gram function, we performed analyses on the data to see which skills were most valuable. Plotting our data onto graphs allowed us to easily understand and interpret our results in different ways, while also being very aesthetically pleasing to see.

From our experience working together on this project, we can confidently say that leadership and communication were the two most important soft skills. Thanks these, we made this project a success despite varying skill levels and backgrounds because we all shared what was on our minds, with a great sense of direction of what to do next. As far as hard skills are concerned, the most important skills were building and designing databases, importing from the database into R, and performing the analyses.

This project was a pleasure to do as we all learned from each other and expanded our skill sets simultaneously. We now understand as witnesses that soft skills are just as important as the hard skills. This is because the soft skills allowed us to get the work done in an effective manner, which is the most important goal for any project.

DATA607 Project 3 – Data Science Skills

Joe Connolly, Gabriel Campos, Gabriella Martinez,Maliat Islam and Peter Gatica

April 04 2021

Main Title

Assignment Requirements

Grading rubric:

Database Creation, Import & Export

Connect to local Database

Load CSV

Manipulating Imported CSV Dataset

Write to Database

Analysis

N-Gram function

Frequency Count

Hard and Tech Skills

Soft Skills

Percentages

Word cloud and TreeMap

TreeMap

Tech Skills

Soft Skills

Conclusion