1 Goal

Please use data to answer the question, “Which are the most valued data science skills?”

2 Pre-Requistes : Available Libraries

  • googlesheets
  • readxl
  • DT
  • data.table
  • kableExtra
  • dplyr
  • tidyr
  • tidyverse
  • rlang
  • stringr
  • RMySQL
  • DBI
  • ggplot2
  • tm
  • wordcloud2
  • RColorBrewer

3 Gather data

3.1 Read Data Source into R from Google/Spread Sheets

3.2 Show data

3.2.1 Untidy Data- Job Listings

Source Count
LinkedIn 8400
Indeed 5200
SimplyHired 3800
Monster 3750
AngelList 600

3.2.2 Untidy Data Table - Language Skills

Language LinkedIn Indeed SimplyHired Monster LinkedIn % Indeed % SimplyHired % Monster % Avg % GlassDoor Self Reported % 2017 Difference
Python 6347 3818 2888 2544 74% 74% 75% 68% 73% 72% 1%
R 4553 3106 2393 2365 53% 60% 62% 63% 60% 64% -4%
SQL 3879 2628 2056 1841 45% 51% 54% 49% 50% 51% -1%
Spark 2169 1551 1167 1062 25% 30% 30% 28% 29% 27% 2%
Hadoop 2142 1578 1164 1200 25% 31% 30% 32% 30% 39% -9%
Java 1944 1377 1059 1002 23% 27% 28% 27% 26% 33% -7%

3.2.3 Untidy Data Table - Software Skills

Software LinkedIn Indeed SimplyHired Monster
machine learning 5701 3439 2561 2340
analysis 5168 3500 2668 3306
statistics 4893 2992 2308 2399
computer science 4517 2739 2093 1900
communication 3404 2344 1791 2053
mathematics 2605 1961 1497 1815

3.2.4 Untidy Data Table - Demographics

Due to 16K rows and more than 200+ columns, it is taking time and eventually times out. So commented out the code to view the data.

4 Data Wrangling

Transformation & Tidying data for Language Skill sheet, before loading into DB

Step1 : For Language skill sheet, we are subsetting the data on first column using
select function from dplyr package. Then using the na.omit function from base package
to omit any rows containing NA in the data frame. then last function we are using to subset
the data frame on rows using the slice function.

Step2 : Using the base function scale to make the axis data scaler for plotting the graph.

Step3 : Merging the original dataframe first column and the newly created scaled dataframe using select function from dplyr & cbing function from base r .

Step4 : Now for sorting , we first calculated the mean and added the new AVGMean column to the dataframe, using mutate and mean functions.

Step5 : Now order based on the AvgMean column , and select the top 15 values, and then using gather function , gather columns into rows for plotting the graph.

Scale function :- In simpler terms without changing the data , we change the scale of the data (axis values while plotting)

language_skills_df<- slice(na.omit(dplyr::select(df_Language_Skills,1:5)),1:37)
languagedf <- dplyr::select(language_skills_df,2:5)
languagemaxs <- apply(languagedf, 2, max)
languagemins <- apply(languagedf, 2, min)

languageScaled_df <- as.data.frame(scale(languagedf, center = languagemins, scale = languagemaxs - languagemins))

language_df <- dplyr::select(language_skills_df,1)
languageSkillSet <- cbind.data.frame(language_df,languageScaled_df)
languageSkillSetDF <- mutate(languageSkillSet,AvgMean = apply(languageScaled_df,1, mean))
orderedLanguageSkillSetDF <- languageSkillSetDF[order(-languageSkillSetDF$AvgMean),]
orderedLanguageSkillSetDF_excludedAvgMean <- head(dplyr::select(orderedLanguageSkillSetDF,1:5),15)
languageSkillSetDF <- head(languageSkillSetDF,15)

orderedLanguageSkillSetDF_excludedAvgMean.long <- gather(orderedLanguageSkillSetDF_excludedAvgMean,variable, value,-1)

Transformation & Tidying data for Software Skill sheet, before loading into DB

Step1 : Using na.omit function remove all rows with NA values.

Step2 : Using filter method , to subset observations i.e. based on certain logic extract rows.

Step3 : Then using slice to further subset the dataframe .

software_skills_df <- na.omit(df_Software_Skills)
software_skills_df <- filter(software_skills_df,Software!="Total" )
software_skills_df <- slice(software_skills_df , 1:15)

Transformation & Tidying data for Software Skill sheet, before loading into DB

Step1 : Using select function, subsetting the dataframe.

step2 : Wrote function to select the questions with only one answer, remove rows with no respondent answers, group by responses to the question, counting the respondents asnswering and finally calculation the percentage for the same. Finally arranginf in descending order.

mcr_df <- df_MCR %>% select(c("GenderSelect", "Country", "Age", "CurrentJobTitleSelect", "EmploymentStatus", "Tenure", "FormalEducation","FirstTrainingSelect","LearningCategorySelftTaught","LearningCategoryOnlineCourses","LearningCategoryWork","LearningCategoryUniversity","LearningCategoryKaggle","LearningCategoryOther"))
mcr_df$Age <- as.numeric(as.character(mcr_df$Age))

# Function for single choice questions : A function to analyze questions where you choose only one answer
chooseOne = function(question, filterData = mcr_df){
  filterData %>% 
    filter(!UQ(sym(question)) == "") %>% 
    group_by_(question) %>% 
    summarise(count = n()) %>% 
    mutate(percent = (count / sum(count)) * 100) %>%  
    arrange(desc(count)) 
}

5 Load data into Database

Step1 : Using dbconnect function to make a connection to Google Cloud MySQL database.

## [1] "Connected to mysql on Sun Mar 24 23:53:54 2019"

Step2 : Create and Load Tables in mySQL Google Cloud , below are the tables to which data is loaded.
- Job_Listings_Tbl
- Language_Skills_Tbl
- Software_Skills_Tbl

Step3 : Using various functions from RMySQL package e.g. (dbExistsTable,dbSendQuery,dbRemoveTable and dbWriteTable) to create and remove table and check if table already exist , and to query the database.

6 Visualizations

6.1 Job Listings

Looking at the graph, we can safely assume that in the month of Oct 2018 LinkedIn posted the highest number of job requirements for Data Science .

6.2 Language Skills

Looking at the graph above, wecan safely assume that Python, followed by R and then SQL are most preffered language for Data Science in all Job sites and similarly at the end of the specturm C is least preffered language for Data Science.

6.3 Software Skills

Looking at the above graphs, we can safelu assume that Analysis is the most preffered Software skill in all Job portal other than LinkedIn, whereas Machine Learning is the most preffered software skill in LinkedIn. On the other end of specturm Data Engineering is the least preffered language in all Job Portals.

6.4 Demographics on Data Science Users

First Data Science Learning
FirstTrainingSelect count percent
Online courses (coursera, udemy, edx, etc.) 5299 36.018216
University courses 4162 28.289831
Self-taught 3697 25.129146
Work 990 6.729201
Other 307 2.086732
Kaggle competitions 257 1.746873

Employment Status
EmploymentStatus count percent
Employed full-time 10897 65.1890404
Not employed, but looking for work 2110 12.6226370
Independent contractor, freelancer, or self-employed 1330 7.9564489
Not employed, and not looking for work 924 5.5276382
Employed part-time 917 5.4857621
I prefer not to say 420 2.5125628
Retired 118 0.7059105
Job Titles
CurrentJobTitleSelect count percent
Data Scientist 2433 20.5663567
Software Developer/Software Engineer 1759 14.8689772
Other 1233 10.4226543
Data Analyst 1213 10.2535926
Scientist/Researcher 978 8.2671175
Business Analyst 796 6.7286560
Researcher 619 5.2324598
Machine Learning Engineer 617 5.2155537
Engineer 552 4.6661031
Programmer 462 3.9053254
Computer Scientist 335 2.8317836
Statistician 289 2.4429417
DBA/Database Engineer 187 1.5807270
Predictive Modeler 181 1.5300085
Data Miner 118 0.9974641
Operations Research Practitioner 58 0.4902790
Experience
Tenure count percent
1 to 2 years 3424 25.302985
3 to 5 years 3355 24.793083
Less than a year 2380 17.587940
More than 10 years 2028 14.986698
6 to 10 years 1714 12.666272
I don’t write code to analyze data 631 4.663021
Formal Education
FormalEducation count percent
Master’s degree 6273 41.7782218
Bachelor’s degree 4811 32.0412920
Doctoral degree 2347 15.6310356
Some college/university study without earning a bachelor’s degree 786 5.2347652
Professional degree 451 3.0036630
I did not complete any formal education past high school 257 1.7116217
I prefer not to answer 90 0.5994006

Country Looking at the above graph, we can safely assume that US and India are home to the most no of data science users . Russia, UK, People’s Republic of China, Brazil, Germany, France, Canada, and Australia are the close behind.

Gender

Data Science users of different gender identities generally fall into the similar age distributions

Age

## # A tibble: 1 x 2
##   median    sd
##    <dbl> <dbl>
## 1     30  10.5

** Looking at the above graph , The median age is 30 years (plus or minus 10 years).**

From above Graph it is clear that Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there’s a wider age-range of users in the US and UK.

7 Summary

Overall the language most preffered for Data Science skill is Python followed by R & SQL, whereas Analysis followed by Machine learning is software skill which is in demand.