1 Goal

Please use data to answer the question, “Which are the most valued data science skills?”

2 Pre-Requistes : Available Libraries

  • googlesheets
  • readxl
  • DT
  • data.table
  • kableExtra
  • dplyr
  • tidyr
  • tidyverse
  • rlang
  • stringr
  • RMySQL
  • DBI
  • ggplot2
  • tm
  • wordcloud2
  • RColorBrewer

3 Gather data

3.1 Read Data Source into R from Google/Spread Sheets

Load above survey data into R data frames using spreadsheet functions

# which google sheets do you have access to? may ask you to authenticate in a browser!
#gs_ls("https://docs.google.com/spreadsheets/d/1rTr2r5NlSy8QBEhqwpP0HL2aZfWZflrN8kw7KC3Vi5M/edit#gid=1248983012")
#gs_ls("https://docs.google.com/spreadsheets/d/1rTr2r5NlSy8QBEhqwpP0HL2aZfWZflrN8kw7KC3Vi5M/edit?usp=sharing")

#SpreadSheet <- gs_title("Data Science")

#Get Sheet names
#gs_ws_ls(SpreadSheet)

# convert to data.frame
#df_Job_Listings <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Job Listings"))
#df_Language_Skills <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Language Skills"))
#df_Software_Skills <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Software Skills"))

myWorkingDir <- getwd()
mySourceFile <- paste0(myWorkingDir,"/Data Science.xlsx")
excel_sheets(path = mySourceFile)
## [1] "Job Listings"            "Language Skills"        
## [3] "Software Skills"         "General Skills"         
## [5] "Background Profile"      "multipleChoiceResponses"
#df_Jobs <- read_excel(path = mySourceFile, sheet = 1, range = "A1:B6")
#df_Languages <- read_excel(path = mySourceFile, sheet = "Language Skills", range = "A1:E38")
#df_Softwares <- read_excel(path = mySourceFile, sheet = "Software Skills", range = "A1:E16")
df_Job_Listings <- read_excel(path = mySourceFile, sheet = 1)
df_Language_Skills <- read_excel(path = mySourceFile, sheet = "Language Skills")
df_Software_Skills <- read_excel(path = mySourceFile, sheet = "Software Skills")
df_MCR <- read_excel(path = mySourceFile, sheet = "multipleChoiceResponses")

3.2 Show Unitdy data

3.2.1 Untidy Data- Job Listings

DT::datatable(df_Job_Listings, options = list(pagelength=5))
kable(df_Job_Listings) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
Source Count
LinkedIn 8400
Indeed 5200
SimplyHired 3800
Monster 3750
AngelList 600

3.2.2 Untidy Data Table - Language Skills

DT::datatable(df_Language_Skills, options = list(pagelength=5))
kable(tail(df_Language_Skills)) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
Language LinkedIn Indeed SimplyHired Monster LinkedIn % Indeed % SimplyHired % Monster % Avg % GlassDoor Self Reported % 2017 Difference
NA NA NA NA NA NA NA NA NA NA NA NA
Total 38882 27477 21204 19350 NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA NA
Search Criteria NA NA NA NA NA NA NA NA NA NA NA
“data scientist” alone 8610 5138 3829 3746 NA NA NA NA NA NA NA
“data scientist” “[keyword]” NA NA NA NA NA NA NA NA NA NA NA

3.2.3 Untidy Data Table - Software Skills

DT::datatable(df_Software_Skills, options = list(pagelength=5))
Software LinkedIn Indeed SimplyHired Monster
NLP 643 466 362 576
natural language processing 791 621 429 575
NLP + natural language processing 222 177 131 569
NA NA NA NA NA
“data scientist” “[keyword]” NA NA NA NA
“data engineering’ searched NA NA NA NA

3.2.4 Untidy Data Table - Demographics

DT::datatable(df_MCR, options = list(pagelength=5))

4 Data Wrangling

  • Clean up the data with relevant columns
  • Remove redundant data (rows and columns)
  • Transpose data where relevant

Transformation & Tidying data for Language Skill sheet

  • Step1 : Subset the data on first column using select function from “dplyr”. Then using the “na.omit”" function from base package to omit any rows containing NA in the data frame. then last function we are using to subset the data frame on rows using the slice function.
  • Step2 : Using the base function scale to make the axis data scaler for plotting the graph.
  • Step3 : Merging the original dataframe first column and the newly created scaled dataframe using select function from dplyr & cbing function from base r .
  • Step4 : Now for sorting , we first calculated the mean and added the new AVGMean column to the dataframe, using mutate and mean functions.
  • Step5 : Now order based on the AvgMean column , and select the top 15 values, and then using gather function , gather columns into rows for plotting the graph.
language_skills_df<- slice(na.omit(dplyr::select(df_Language_Skills,1:5)),1:37)
languagedf <- dplyr::select(language_skills_df,2:5)
languagemaxs <- apply(languagedf, 2, max)
languagemins <- apply(languagedf, 2, min)

languageScaled_df <- as.data.frame(scale(languagedf, center = languagemins, scale = languagemaxs - languagemins))
#head(languageScaled_df)
language_df <- dplyr::select(language_skills_df,1)
languageSkillSet <- cbind.data.frame(language_df,languageScaled_df)
languageSkillSetDF <- mutate(languageSkillSet,AvgMean = apply(languageScaled_df,1, mean))
orderedLanguageSkillSetDF <- languageSkillSetDF[order(-languageSkillSetDF$AvgMean),]
orderedLanguageSkillSetDF_excludedAvgMean <- head(dplyr::select(orderedLanguageSkillSetDF,1:5),15)
languageSkillSetDF <- head(languageSkillSetDF,15)
#orderedLanguageSkillSetDF_excludedAvgMean
orderedLanguageSkillSetDF_excludedAvgMean.long <- gather(orderedLanguageSkillSetDF_excludedAvgMean,variable, value,-1)
#head(orderedLanguageSkillSetDF_excludedAvgMean.long)

Transformation & Tidying data for Software Skill sheet

  • Step1 : Using “na.omit”" function remove all rows with NA values.
  • Step2 : Using “filter”" method , to subset observations i.e. based on certain logic extract rows.
  • Step3 : Then using “slice”" to further subset the dataframe .
#is.na.data.frame(software_skills_df)
software_skills_df <- na.omit(df_Software_Skills)
software_skills_df <- filter(software_skills_df,Software!="Total" )
software_skills_df <- slice(software_skills_df , 1:15)

Transformation & Tidying data for Demographic sheet

  • Select specific required columns from long and wide data set using “dplyr”, tidyverse“,”rlang" and “stringr”
  • Convert non-numeric Age field to numeric
  • Create custom function to dynamically clean-up data by filter, group_by, summarize, mutate and arrange methods

5 Load data into Database

Load the tidy data from R data frame to mySQL Google Cloud relational database for data analysis and visual representation

5.1 Connect to mySQL Google Cloud

conn <- dbConnect(dbDriver('MySQL'),
                  dbname="CUNY_DATA607",
                  host="35.231.71.159",
                  user="root",
                  password="data607", 
                  port=3306)
paste("Connected to mysql on", date())
## [1] "Connected to mysql on Mon Mar 25 21:52:21 2019"

5.2 Create and Load Tables in mySQL Google Cloud

  • Load into Job_Listings_Tbl table directly from the data frame via overwrite

  • Load into Language_Skills_Tbl table via drop if exists (dbSendQuery) and create table (dbSendQuery) followed by load (dbWriteTable)

# Drop table if it already exists
dbSendQuery(conn, "DROP TABLE IF EXISTS Language_Skills_Tbl;")
## <MySQLResult:16646144,0,5>
# Create table
dbSendQuery(conn, "CREATE TABLE Language_Skills_Tbl 
(Language     VARCHAR(100) NOT NULL,
 LinkedIn     VARCHAR(100) NOT NULL,
 Indeed         VARCHAR(100) NOT NULL,
 SimplyHired  VARCHAR(100) NOT NULL,
 Monster      VARCHAR(100) NOT NULL,
 CONSTRAINT pk_Language_Skills PRIMARY KEY (Language)
 );")
## <MySQLResult:10,0,6>
# Load table
dbWriteTable(conn,"Language_Skills_Tbl",language_skills_df,overwrite=T)
## [1] TRUE
  • Load into Software_Skills_Tbl table via drop (dbRemoveTable) if exists (dbExistsTable) and load (dbWriteTable)
# Drop table if it already exists
if (dbExistsTable(conn, "Software_Skills_Tbl"))
    dbRemoveTable(conn, "Software_Skills_Tbl")
## [1] TRUE
# Write the data frame to the database
dbWriteTable(conn, name = "Software_Skills_Tbl", value = software_skills_df, row.names = FALSE)
## [1] TRUE

6 Visualizations

6.1 Job Listings

tbl(conn, "Job_Listings_Tbl") %>%
  collect() %>%
  ggplot(aes(x=Source, y=Count, fill=Source, color = Source)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    ggtitle("Job Listings for Data Scientist for month of October 2018") +
    xlab("Source") + ylab("Job Portal") +
    geom_text(aes(label=paste(Count)), vjust=0.5, hjust=1.1,color="black") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

As per above graph, we see that in the month of Oct 2018, LinkedIn posted the highest number of job positions for Data Science

6.2 Software Skills

ggplot(data = orderedSkillSetDF_excludedAvgMean.long, aes(x = orderedSkillSetDF_excludedAvgMean.long$Software, y = value,fill = variable)) +
  scale_x_discrete(limits=c(orderedSkillSetDF$Software)) +
  geom_col(position = position_dodge())  +
  labs(title = "Data Science Skill graph", x="Skill Set", y="Scaled Listing") +
  theme(axis.text.x=element_text(angle=75,hjust=.45,vjust=0.5)) +
  coord_flip()

# scale_x_discrete(limits=orderedSkillSetDF_excludedAvgMean.long$Software) +
#all.equal(software_skills_df, resultsSoftwareSkills)
#which(software_skills_df != resultsSoftwareSkills)
df_Softwares <- dbGetQuery(conn, "SELECT * FROM Software_Skills_Tbl")
df_Software_Skills <- gather(df_Softwares, "Portal", "Count",2:5)
qplot(x=Software, y=Count, data=df_Software_Skills, shape=Portal, size=I(5), colour=Portal, 
      xlab="Software Skills", ylab="Frequency", main = "Software Skills by Frequency") + 
  theme(text = element_text(), axis.text.x = element_text(angle=90, vjust=1)) 

As per above graph, we see that most poular software development skills for Data Science are machine learning and data analysis

6.3 Language Skills

#tbl(conn, "Language_Skills") %>%
tbl(conn, sql("SELECT Language,LinkedIn,Indeed,SimplyHired,Monster FROM Language_Skills ORDER BY (LinkedIn+Indeed+SimplyHired+Monster) DESC LIMIT 15")) %>%
  collect() %>%
  tidyr::gather("Portal", "Count",2:5) %>%
  ggplot(aes(x=reorder(Language,Count), y=Count)) + 
  geom_bar(aes(fill = Portal), position = "dodge", stat = "identity") + 
  xlab("Portal") +
  ylab("Job Count") + 
  theme(text = element_text(), axis.text.x = element_text(angle=90, vjust=1)) 

tbl(conn, sql("SELECT Language, LinkedIn+Indeed+SimplyHired+Monster FROM Language_Skills_Tbl")) %>%
  wordcloud2(size=0.9, color='random-dark', shape = 'circle')

As per above graph, we see that most poular language skills for Data Science are Python followed by R and SQL. Its no surprise since Python is great for machine learning and AI whereas R and SQL are great tools for data cleanup and data analysis with visual representations

6.4 Demographics on Data Science Users

First Data Science Learning

chooseOne("FirstTrainingSelect") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
FirstTrainingSelect count percent
Online courses (coursera, udemy, edx, etc.) 5299 36.018216
University courses 4162 28.289831
Self-taught 3697 25.129146
Work 990 6.729201
Other 307 2.086732
Kaggle competitions 257 1.746873
ggplot(training, aes(x = percent, fill = response)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~response) + 
  xlab("Percent of Learning") + 
  ylab("Count of Given Percentage") + 
  theme(legend.position="none")

Employment Status

chooseOne("EmploymentStatus") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
EmploymentStatus count percent
Employed full-time 10897 65.1890404
Not employed, but looking for work 2110 12.6226370
Independent contractor, freelancer, or self-employed 1330 7.9564489
Not employed, and not looking for work 924 5.5276382
Employed part-time 917 5.4857621
I prefer not to say 420 2.5125628
Retired 118 0.7059105

Job Titles

chooseOne("CurrentJobTitleSelect") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
CurrentJobTitleSelect count percent
Data Scientist 2433 20.5663567
Software Developer/Software Engineer 1759 14.8689772
Other 1233 10.4226543
Data Analyst 1213 10.2535926
Scientist/Researcher 978 8.2671175
Business Analyst 796 6.7286560
Researcher 619 5.2324598
Machine Learning Engineer 617 5.2155537
Engineer 552 4.6661031
Programmer 462 3.9053254
Computer Scientist 335 2.8317836
Statistician 289 2.4429417
DBA/Database Engineer 187 1.5807270
Predictive Modeler 181 1.5300085
Data Miner 118 0.9974641
Operations Research Practitioner 58 0.4902790

Experience

chooseOne("Tenure") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
Tenure count percent
1 to 2 years 3424 25.302985
3 to 5 years 3355 24.793083
Less than a year 2380 17.587940
More than 10 years 2028 14.986698
6 to 10 years 1714 12.666272
I don’t write code to analyze data 631 4.663021

Formal Education

chooseOne("FormalEducation") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")
FormalEducation count percent
Master’s degree 6273 41.7782218
Bachelor’s degree 4811 32.0412920
Doctoral degree 2347 15.6310356
Some college/university study without earning a bachelor’s degree 786 5.2347652
Professional degree 451 3.0036630
I did not complete any formal education past high school 257 1.7116217
I prefer not to answer 90 0.5994006
df_formalEducation <- chooseOne("FormalEducation")
pie(x = df_formalEducation$percent, labels = df_formalEducation$CurrentJobTitleSelect, main = "Education", radius = 1)

Country

chooseOne("Country") %>%
  ggplot(aes(x = reorder(Country, count), y = count)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90,
                                     vjust = 0.5,
                                     hjust = 1)) + 
    xlab("Country of Residence")

paste0("So the US and India are home to the most data science users (a combined 41.4% of users). Russia, UK, People's Republic of China, Brazil, Germany, France, Canada, and Australia are the closest behind.")
## [1] "So the US and India are home to the most data science users (a combined 41.4% of users). Russia, UK, People's Republic of China, Brazil, Germany, France, Canada, and Australia are the closest behind."

Gender

ggplot(ageGender, aes(x = Age, fill = GenderSelect)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~GenderSelect) + 
  theme(legend.position="none")

paste0("Data Science users of different gender identities generally fall into the similar age distributions.")
## [1] "Data Science users of different gender identities generally fall into the similar age distributions."

Age

ageHist <- mcr_df %>% filter(!Age == "") %>% select(Age)

ggplot(ageHist, aes(x = Age)) + 
  geom_histogram(binwidth = 2) + 
  xlab("Age (years)") + 
  ylab("Number of Respondents")

**The vast majority of data scientists are young adults (early 20’s to 30’s). What is the median age?“)**

ageHist %>% summarise(median = median(Age, na.rm = TRUE), sd = sd(Age, na.rm = TRUE))
## # A tibble: 1 x 2
##   median    sd
##    <dbl> <dbl>
## 1     30  10.5
paste0("The median age is 30 years (plus or minus 10 years).")
## [1] "The median age is 30 years (plus or minus 10 years)."

Does the Age vary amongst people that come from the 5 countries with the most data science users?

ggplot(top5Age, aes(x = Age, fill = Country)) + 
  geom_density(alpha = 0.3) + 
  facet_wrap(~Country) + 
  ylab("Density of Users of a Given Age") + 
  theme(legend.position="none")

paste0("It looks like Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there's a wider age-range of users in the US and UK.")
## [1] "It looks like Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there's a wider age-range of users in the US and UK."

7 Conclusion

As shown above, we can summarize the following:

  • Most preferred Language Skill : Python, followed by R and SQL
  • Most preferred Software Skill : Machine Learning and Data Analysis
  • Most commonly attained Data Science training : Self-Taught and University Practical Course
  • Most common Data Science role : Data Scientist (20%) followed by Software Developer (15%)
  • Most common formal education degree : Master’s (42%) followed by Bachelor’s Degree (32%)
  • Countries with most Data Science Jobs : United Status followed by India and Russia
  • Median age of people working in Data Science Jobs ~ 30 years