1 Goal

Please use data to answer the question, “Which are the most valued data science skills?”

2 Pre-Requistes : Available Libraries

googlesheets
readxl
DT
data.table
kableExtra
dplyr
tidyr
tidyverse
rlang
stringr
RMySQL
DBI
ggplot2
tm
wordcloud2
RColorBrewer

3 Gather data

3.1 Read Data Source into R from Google/Spread Sheets

Load above survey data into R data frames using spreadsheet functions

# which google sheets do you have access to? may ask you to authenticate in a browser!
#gs_ls("https://docs.google.com/spreadsheets/d/1rTr2r5NlSy8QBEhqwpP0HL2aZfWZflrN8kw7KC3Vi5M/edit#gid=1248983012")
#gs_ls("https://docs.google.com/spreadsheets/d/1rTr2r5NlSy8QBEhqwpP0HL2aZfWZflrN8kw7KC3Vi5M/edit?usp=sharing")

#SpreadSheet <- gs_title("Data Science")

#Get Sheet names
#gs_ws_ls(SpreadSheet)

# convert to data.frame
#df_Job_Listings <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Job Listings"))
#df_Language_Skills <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Language Skills"))
#df_Software_Skills <- as.data.frame(gs_read(ss=SpreadSheet, ws = "Software Skills"))

myWorkingDir <- getwd()
mySourceFile <- paste0(myWorkingDir,"/Data Science.xlsx")
excel_sheets(path = mySourceFile)

## [1] "Job Listings"            "Language Skills"        
## [3] "Software Skills"         "General Skills"         
## [5] "Background Profile"      "multipleChoiceResponses"

#df_Jobs <- read_excel(path = mySourceFile, sheet = 1, range = "A1:B6")
#df_Languages <- read_excel(path = mySourceFile, sheet = "Language Skills", range = "A1:E38")
#df_Softwares <- read_excel(path = mySourceFile, sheet = "Software Skills", range = "A1:E16")
df_Job_Listings <- read_excel(path = mySourceFile, sheet = 1)
df_Language_Skills <- read_excel(path = mySourceFile, sheet = "Language Skills")
df_Software_Skills <- read_excel(path = mySourceFile, sheet = "Software Skills")
df_MCR <- read_excel(path = mySourceFile, sheet = "multipleChoiceResponses")

3.2 Show Unitdy data

3.2.1 Untidy Data- Job Listings

DT::datatable(df_Job_Listings, options = list(pagelength=5))

kable(df_Job_Listings) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

Source	Count
LinkedIn	8400
Indeed	5200
SimplyHired	3800
Monster	3750
AngelList	600

3.2.2 Untidy Data Table - Language Skills

DT::datatable(df_Language_Skills, options = list(pagelength=5))

kable(tail(df_Language_Skills)) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

Language	LinkedIn	Indeed	SimplyHired	Monster	LinkedIn %	Indeed %	SimplyHired %	Monster %	Avg %	GlassDoor Self Reported % 2017	Difference
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Total	38882	27477	21204	19350	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Search Criteria	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
“data scientist” alone	8610	5138	3829	3746	NA	NA	NA	NA	NA	NA	NA
“data scientist” “[keyword]”	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

3.2.3 Untidy Data Table - Software Skills

DT::datatable(df_Software_Skills, options = list(pagelength=5))

Software	LinkedIn	Indeed	SimplyHired	Monster
NLP	643	466	362	576
natural language processing	791	621	429	575
NLP + natural language processing	222	177	131	569
NA	NA	NA	NA	NA
“data scientist” “[keyword]”	NA	NA	NA	NA
“data engineering’ searched	NA	NA	NA	NA

3.2.4 Untidy Data Table - Demographics

DT::datatable(df_MCR, options = list(pagelength=5))

4 Data Wrangling

Clean up the data with relevant columns
Remove redundant data (rows and columns)
Transpose data where relevant

Transformation & Tidying data for Language Skill sheet

Step1 : Subset the data on first column using select function from “dplyr”. Then using the “na.omit”" function from base package to omit any rows containing NA in the data frame. then last function we are using to subset the data frame on rows using the slice function.
Step2 : Using the base function scale to make the axis data scaler for plotting the graph.
Step3 : Merging the original dataframe first column and the newly created scaled dataframe using select function from dplyr & cbing function from base r .
Step4 : Now for sorting , we first calculated the mean and added the new AVGMean column to the dataframe, using mutate and mean functions.
Step5 : Now order based on the AvgMean column , and select the top 15 values, and then using gather function , gather columns into rows for plotting the graph.

language_skills_df<- slice(na.omit(dplyr::select(df_Language_Skills,1:5)),1:37)
languagedf <- dplyr::select(language_skills_df,2:5)
languagemaxs <- apply(languagedf, 2, max)
languagemins <- apply(languagedf, 2, min)

languageScaled_df <- as.data.frame(scale(languagedf, center = languagemins, scale = languagemaxs - languagemins))
#head(languageScaled_df)
language_df <- dplyr::select(language_skills_df,1)
languageSkillSet <- cbind.data.frame(language_df,languageScaled_df)
languageSkillSetDF <- mutate(languageSkillSet,AvgMean = apply(languageScaled_df,1, mean))
orderedLanguageSkillSetDF <- languageSkillSetDF[order(-languageSkillSetDF$AvgMean),]
orderedLanguageSkillSetDF_excludedAvgMean <- head(dplyr::select(orderedLanguageSkillSetDF,1:5),15)
languageSkillSetDF <- head(languageSkillSetDF,15)
#orderedLanguageSkillSetDF_excludedAvgMean
orderedLanguageSkillSetDF_excludedAvgMean.long <- gather(orderedLanguageSkillSetDF_excludedAvgMean,variable, value,-1)
#head(orderedLanguageSkillSetDF_excludedAvgMean.long)

Transformation & Tidying data for Software Skill sheet

Step1 : Using “na.omit”" function remove all rows with NA values.
Step2 : Using “filter”" method , to subset observations i.e. based on certain logic extract rows.
Step3 : Then using “slice”" to further subset the dataframe .

#is.na.data.frame(software_skills_df)
software_skills_df <- na.omit(df_Software_Skills)
software_skills_df <- filter(software_skills_df,Software!="Total" )
software_skills_df <- slice(software_skills_df , 1:15)

Transformation & Tidying data for Demographic sheet

Select specific required columns from long and wide data set using “dplyr”, tidyverse“,”rlang" and “stringr”
Convert non-numeric Age field to numeric
Create custom function to dynamically clean-up data by filter, group_by, summarize, mutate and arrange methods

5 Load data into Database

Load the tidy data from R data frame to mySQL Google Cloud relational database for data analysis and visual representation

5.1 Connect to mySQL Google Cloud

conn <- dbConnect(dbDriver('MySQL'),
                  dbname="CUNY_DATA607",
                  host="35.231.71.159",
                  user="root",
                  password="data607", 
                  port=3306)
paste("Connected to mysql on", date())

## [1] "Connected to mysql on Mon Mar 25 21:52:21 2019"

5.2 Create and Load Tables in mySQL Google Cloud

Load into Job_Listings_Tbl table directly from the data frame via overwrite
Load into Language_Skills_Tbl table via drop if exists (dbSendQuery) and create table (dbSendQuery) followed by load (dbWriteTable)

# Drop table if it already exists
dbSendQuery(conn, "DROP TABLE IF EXISTS Language_Skills_Tbl;")

## <MySQLResult:16646144,0,5>

# Create table
dbSendQuery(conn, "CREATE TABLE Language_Skills_Tbl 
(Language     VARCHAR(100) NOT NULL,
 LinkedIn     VARCHAR(100) NOT NULL,
 Indeed         VARCHAR(100) NOT NULL,
 SimplyHired  VARCHAR(100) NOT NULL,
 Monster      VARCHAR(100) NOT NULL,
 CONSTRAINT pk_Language_Skills PRIMARY KEY (Language)
 );")

## <MySQLResult:10,0,6>

# Load table
dbWriteTable(conn,"Language_Skills_Tbl",language_skills_df,overwrite=T)

## [1] TRUE

Load into Software_Skills_Tbl table via drop (dbRemoveTable) if exists (dbExistsTable) and load (dbWriteTable)

# Drop table if it already exists
if (dbExistsTable(conn, "Software_Skills_Tbl"))
    dbRemoveTable(conn, "Software_Skills_Tbl")

## [1] TRUE

# Write the data frame to the database
dbWriteTable(conn, name = "Software_Skills_Tbl", value = software_skills_df, row.names = FALSE)

## [1] TRUE

6 Visualizations

6.1 Job Listings

tbl(conn, "Job_Listings_Tbl") %>%
  collect() %>%
  ggplot(aes(x=Source, y=Count, fill=Source, color = Source)) +
    geom_bar(stat="identity", position=position_dodge(), colour="black", width = 0.5) +
    ggtitle("Job Listings for Data Scientist for month of October 2018") +
    xlab("Source") + ylab("Job Portal") +
    geom_text(aes(label=paste(Count)), vjust=0.5, hjust=1.1,color="black") +
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

As per above graph, we see that in the month of Oct 2018, LinkedIn posted the highest number of job positions for Data Science

6.2 Software Skills

ggplot(data = orderedSkillSetDF_excludedAvgMean.long, aes(x = orderedSkillSetDF_excludedAvgMean.long$Software, y = value,fill = variable)) +
  scale_x_discrete(limits=c(orderedSkillSetDF$Software)) +
  geom_col(position = position_dodge())  +
  labs(title = "Data Science Skill graph", x="Skill Set", y="Scaled Listing") +
  theme(axis.text.x=element_text(angle=75,hjust=.45,vjust=0.5)) +
  coord_flip()

# scale_x_discrete(limits=orderedSkillSetDF_excludedAvgMean.long$Software) +
#all.equal(software_skills_df, resultsSoftwareSkills)
#which(software_skills_df != resultsSoftwareSkills)

df_Softwares <- dbGetQuery(conn, "SELECT * FROM Software_Skills_Tbl")
df_Software_Skills <- gather(df_Softwares, "Portal", "Count",2:5)
qplot(x=Software, y=Count, data=df_Software_Skills, shape=Portal, size=I(5), colour=Portal, 
      xlab="Software Skills", ylab="Frequency", main = "Software Skills by Frequency") + 
  theme(text = element_text(), axis.text.x = element_text(angle=90, vjust=1))

As per above graph, we see that most poular software development skills for Data Science are machine learning and data analysis

6.3 Language Skills

#tbl(conn, "Language_Skills") %>%
tbl(conn, sql("SELECT Language,LinkedIn,Indeed,SimplyHired,Monster FROM Language_Skills ORDER BY (LinkedIn+Indeed+SimplyHired+Monster) DESC LIMIT 15")) %>%
  collect() %>%
  tidyr::gather("Portal", "Count",2:5) %>%
  ggplot(aes(x=reorder(Language,Count), y=Count)) + 
  geom_bar(aes(fill = Portal), position = "dodge", stat = "identity") + 
  xlab("Portal") +
  ylab("Job Count") + 
  theme(text = element_text(), axis.text.x = element_text(angle=90, vjust=1))

tbl(conn, sql("SELECT Language, LinkedIn+Indeed+SimplyHired+Monster FROM Language_Skills_Tbl")) %>%
  wordcloud2(size=0.9, color='random-dark', shape = 'circle')

As per above graph, we see that most poular language skills for Data Science are Python followed by R and SQL. Its no surprise since Python is great for machine learning and AI whereas R and SQL are great tools for data cleanup and data analysis with visual representations

6.4 Demographics on Data Science Users

First Data Science Learning

chooseOne("FirstTrainingSelect") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

FirstTrainingSelect	count	percent
Online courses (coursera, udemy, edx, etc.)	5299	36.018216
University courses	4162	28.289831
Self-taught	3697	25.129146
Work	990	6.729201
Other	307	2.086732
Kaggle competitions	257	1.746873

ggplot(training, aes(x = percent, fill = response)) + 
  geom_histogram(bins = 10) + 
  facet_wrap(~response) + 
  xlab("Percent of Learning") + 
  ylab("Count of Given Percentage") + 
  theme(legend.position="none")

Employment Status

chooseOne("EmploymentStatus") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

EmploymentStatus	count	percent
Employed full-time	10897	65.1890404
Not employed, but looking for work	2110	12.6226370
Independent contractor, freelancer, or self-employed	1330	7.9564489
Not employed, and not looking for work	924	5.5276382
Employed part-time	917	5.4857621
I prefer not to say	420	2.5125628
Retired	118	0.7059105

Job Titles

chooseOne("CurrentJobTitleSelect") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

CurrentJobTitleSelect	count	percent
Data Scientist	2433	20.5663567
Software Developer/Software Engineer	1759	14.8689772
Other	1233	10.4226543
Data Analyst	1213	10.2535926
Scientist/Researcher	978	8.2671175
Business Analyst	796	6.7286560
Researcher	619	5.2324598
Machine Learning Engineer	617	5.2155537
Engineer	552	4.6661031
Programmer	462	3.9053254
Computer Scientist	335	2.8317836
Statistician	289	2.4429417
DBA/Database Engineer	187	1.5807270
Predictive Modeler	181	1.5300085
Data Miner	118	0.9974641
Operations Research Practitioner	58	0.4902790

Experience

chooseOne("Tenure") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

Tenure	count	percent
1 to 2 years	3424	25.302985
3 to 5 years	3355	24.793083
Less than a year	2380	17.587940
More than 10 years	2028	14.986698
6 to 10 years	1714	12.666272
I don’t write code to analyze data	631	4.663021

Formal Education

chooseOne("FormalEducation") %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed","responsive"),full_width   = F,position = "left",font_size = 12) %>%
  row_spec(0, background ="gray")

FormalEducation	count	percent
Master’s degree	6273	41.7782218
Bachelor’s degree	4811	32.0412920
Doctoral degree	2347	15.6310356
Some college/university study without earning a bachelor’s degree	786	5.2347652
Professional degree	451	3.0036630
I did not complete any formal education past high school	257	1.7116217
I prefer not to answer	90	0.5994006

df_formalEducation <- chooseOne("FormalEducation")
pie(x = df_formalEducation$percent, labels = df_formalEducation$CurrentJobTitleSelect, main = "Education", radius = 1)

Country

chooseOne("Country") %>%
  ggplot(aes(x = reorder(Country, count), y = count)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90,
                                     vjust = 0.5,
                                     hjust = 1)) + 
    xlab("Country of Residence")

paste0("So the US and India are home to the most data science users (a combined 41.4% of users). Russia, UK, People's Republic of China, Brazil, Germany, France, Canada, and Australia are the closest behind.")

## [1] "So the US and India are home to the most data science users (a combined 41.4% of users). Russia, UK, People's Republic of China, Brazil, Germany, France, Canada, and Australia are the closest behind."

Gender

ggplot(ageGender, aes(x = Age, fill = GenderSelect)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~GenderSelect) + 
  theme(legend.position="none")

paste0("Data Science users of different gender identities generally fall into the similar age distributions.")

## [1] "Data Science users of different gender identities generally fall into the similar age distributions."

Age

ageHist <- mcr_df %>% filter(!Age == "") %>% select(Age)

ggplot(ageHist, aes(x = Age)) + 
  geom_histogram(binwidth = 2) + 
  xlab("Age (years)") + 
  ylab("Number of Respondents")

**The vast majority of data scientists are young adults (early 20’s to 30’s). What is the median age?“)**

ageHist %>% summarise(median = median(Age, na.rm = TRUE), sd = sd(Age, na.rm = TRUE))

## # A tibble: 1 x 2
##   median    sd
##    <dbl> <dbl>
## 1     30  10.5

paste0("The median age is 30 years (plus or minus 10 years).")

## [1] "The median age is 30 years (plus or minus 10 years)."

Does the Age vary amongst people that come from the 5 countries with the most data science users?

ggplot(top5Age, aes(x = Age, fill = Country)) + 
  geom_density(alpha = 0.3) + 
  facet_wrap(~Country) + 
  ylab("Density of Users of a Given Age") + 
  theme(legend.position="none")

paste0("It looks like Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there's a wider age-range of users in the US and UK.")

## [1] "It looks like Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there's a wider age-range of users in the US and UK."

7 Conclusion

As shown above, we can summarize the following:

Most preferred Language Skill : Python, followed by R and SQL
Most preferred Software Skill : Machine Learning and Data Analysis
Most commonly attained Data Science training : Self-Taught and University Practical Course
Most common Data Science role : Data Scientist (20%) followed by Software Developer (15%)
Most common formal education degree : Master’s (42%) followed by Bachelor’s Degree (32%)
Countries with most Data Science Jobs : United Status followed by India and Russia
Median age of people working in Data Science Jobs ~ 30 years

DATA 607 02 [15961] : Week 8 - Project 3 - Data Science Skills [03/18 - 03/24]

IS 607 - Project 3 - Data Science Skills

Team DeViSal - Act Of Inventing : ’De’babrata Kabiraj, ’Vi’shal Arora, ’Sa’mriti Malhotra

March 22, 2019