1 Goal

Please use data to answer the question, “Which are the most valued data science skills?”

2 Pre-Requistes : Available Libraries

googlesheets
readxl
DT
data.table
kableExtra
dplyr
tidyr
tidyverse
rlang
stringr
RMySQL
DBI
ggplot2
tm
wordcloud2
RColorBrewer

3 Gather data

3.1 Read Data Source into R from Google/Spread Sheets

3.2 Show data

3.2.1 Untidy Data- Job Listings

Source	Count
LinkedIn	8400
Indeed	5200
SimplyHired	3800
Monster	3750
AngelList	600

3.2.2 Untidy Data Table - Language Skills

Language	LinkedIn	Indeed	SimplyHired	Monster	LinkedIn %	Indeed %	SimplyHired %	Monster %	Avg %	GlassDoor Self Reported % 2017	Difference
Python	6347	3818	2888	2544	74%	74%	75%	68%	73%	72%	1%
R	4553	3106	2393	2365	53%	60%	62%	63%	60%	64%	-4%
SQL	3879	2628	2056	1841	45%	51%	54%	49%	50%	51%	-1%
Spark	2169	1551	1167	1062	25%	30%	30%	28%	29%	27%	2%
Hadoop	2142	1578	1164	1200	25%	31%	30%	32%	30%	39%	-9%
Java	1944	1377	1059	1002	23%	27%	28%	27%	26%	33%	-7%

3.2.3 Untidy Data Table - Software Skills

Software	LinkedIn	Indeed	SimplyHired	Monster
machine learning	5701	3439	2561	2340
analysis	5168	3500	2668	3306
statistics	4893	2992	2308	2399
computer science	4517	2739	2093	1900
communication	3404	2344	1791	2053
mathematics	2605	1961	1497	1815

3.2.4 Untidy Data Table - Demographics

Due to 16K rows and more than 200+ columns, it is taking time and eventually times out. So commented out the code to view the data.

4 Data Wrangling

Transformation & Tidying data for Language Skill sheet, before loading into DB

Step1 : For Language skill sheet, we are subsetting the data on first column using
select function from dplyr package. Then using the na.omit function from base package
to omit any rows containing NA in the data frame. then last function we are using to subset
the data frame on rows using the slice function.

Step2 : Using the base function scale to make the axis data scaler for plotting the graph.

Step3 : Merging the original dataframe first column and the newly created scaled dataframe using select function from dplyr & cbing function from base r .

Step4 : Now for sorting , we first calculated the mean and added the new AVGMean column to the dataframe, using mutate and mean functions.

Step5 : Now order based on the AvgMean column , and select the top 15 values, and then using gather function , gather columns into rows for plotting the graph.

Scale function :- In simpler terms without changing the data , we change the scale of the data (axis values while plotting)

language_skills_df<- slice(na.omit(dplyr::select(df_Language_Skills,1:5)),1:37)
languagedf <- dplyr::select(language_skills_df,2:5)
languagemaxs <- apply(languagedf, 2, max)
languagemins <- apply(languagedf, 2, min)

languageScaled_df <- as.data.frame(scale(languagedf, center = languagemins, scale = languagemaxs - languagemins))

language_df <- dplyr::select(language_skills_df,1)
languageSkillSet <- cbind.data.frame(language_df,languageScaled_df)
languageSkillSetDF <- mutate(languageSkillSet,AvgMean = apply(languageScaled_df,1, mean))
orderedLanguageSkillSetDF <- languageSkillSetDF[order(-languageSkillSetDF$AvgMean),]
orderedLanguageSkillSetDF_excludedAvgMean <- head(dplyr::select(orderedLanguageSkillSetDF,1:5),15)
languageSkillSetDF <- head(languageSkillSetDF,15)

orderedLanguageSkillSetDF_excludedAvgMean.long <- gather(orderedLanguageSkillSetDF_excludedAvgMean,variable, value,-1)

Transformation & Tidying data for Software Skill sheet, before loading into DB

Step1 : Using na.omit function remove all rows with NA values.

Step2 : Using filter method , to subset observations i.e. based on certain logic extract rows.

Step3 : Then using slice to further subset the dataframe .

software_skills_df <- na.omit(df_Software_Skills)
software_skills_df <- filter(software_skills_df,Software!="Total" )
software_skills_df <- slice(software_skills_df , 1:15)

Transformation & Tidying data for Software Skill sheet, before loading into DB

Step1 : Using select function, subsetting the dataframe.

step2 : Wrote function to select the questions with only one answer, remove rows with no respondent answers, group by responses to the question, counting the respondents asnswering and finally calculation the percentage for the same. Finally arranginf in descending order.

mcr_df <- df_MCR %>% select(c("GenderSelect", "Country", "Age", "CurrentJobTitleSelect", "EmploymentStatus", "Tenure", "FormalEducation","FirstTrainingSelect","LearningCategorySelftTaught","LearningCategoryOnlineCourses","LearningCategoryWork","LearningCategoryUniversity","LearningCategoryKaggle","LearningCategoryOther"))
mcr_df$Age <- as.numeric(as.character(mcr_df$Age))

# Function for single choice questions : A function to analyze questions where you choose only one answer
chooseOne = function(question, filterData = mcr_df){
  filterData %>% 
    filter(!UQ(sym(question)) == "") %>% 
    group_by_(question) %>% 
    summarise(count = n()) %>% 
    mutate(percent = (count / sum(count)) * 100) %>%  
    arrange(desc(count)) 
}

5 Load data into Database

Step1 : Using dbconnect function to make a connection to Google Cloud MySQL database.

## [1] "Connected to mysql on Sun Mar 24 23:53:54 2019"

Step2 : Create and Load Tables in mySQL Google Cloud , below are the tables to which data is loaded.
- Job_Listings_Tbl
- Language_Skills_Tbl
- Software_Skills_Tbl

Step3 : Using various functions from RMySQL package e.g. (dbExistsTable,dbSendQuery,dbRemoveTable and dbWriteTable) to create and remove table and check if table already exist , and to query the database.

6 Visualizations

6.1 Job Listings

Looking at the graph, we can safely assume that in the month of Oct 2018 LinkedIn posted the highest number of job requirements for Data Science .

6.2 Language Skills

Looking at the graph above, wecan safely assume that Python, followed by R and then SQL are most preffered language for Data Science in all Job sites and similarly at the end of the specturm C is least preffered language for Data Science.

6.3 Software Skills

Looking at the above graphs, we can safelu assume that Analysis is the most preffered Software skill in all Job portal other than LinkedIn, whereas Machine Learning is the most preffered software skill in LinkedIn. On the other end of specturm Data Engineering is the least preffered language in all Job Portals.

6.4 Demographics on Data Science Users

First Data Science Learning

FirstTrainingSelect	count	percent
Online courses (coursera, udemy, edx, etc.)	5299	36.018216
University courses	4162	28.289831
Self-taught	3697	25.129146
Work	990	6.729201
Other	307	2.086732
Kaggle competitions	257	1.746873

Employment Status

EmploymentStatus	count	percent
Employed full-time	10897	65.1890404
Not employed, but looking for work	2110	12.6226370
Independent contractor, freelancer, or self-employed	1330	7.9564489
Not employed, and not looking for work	924	5.5276382
Employed part-time	917	5.4857621
I prefer not to say	420	2.5125628
Retired	118	0.7059105

Job Titles

CurrentJobTitleSelect	count	percent
Data Scientist	2433	20.5663567
Software Developer/Software Engineer	1759	14.8689772
Other	1233	10.4226543
Data Analyst	1213	10.2535926
Scientist/Researcher	978	8.2671175
Business Analyst	796	6.7286560
Researcher	619	5.2324598
Machine Learning Engineer	617	5.2155537
Engineer	552	4.6661031
Programmer	462	3.9053254
Computer Scientist	335	2.8317836
Statistician	289	2.4429417
DBA/Database Engineer	187	1.5807270
Predictive Modeler	181	1.5300085
Data Miner	118	0.9974641
Operations Research Practitioner	58	0.4902790

Experience

Tenure	count	percent
1 to 2 years	3424	25.302985
3 to 5 years	3355	24.793083
Less than a year	2380	17.587940
More than 10 years	2028	14.986698
6 to 10 years	1714	12.666272
I don’t write code to analyze data	631	4.663021

Formal Education

FormalEducation	count	percent
Master’s degree	6273	41.7782218
Bachelor’s degree	4811	32.0412920
Doctoral degree	2347	15.6310356
Some college/university study without earning a bachelor’s degree	786	5.2347652
Professional degree	451	3.0036630
I did not complete any formal education past high school	257	1.7116217
I prefer not to answer	90	0.5994006

Country Looking at the above graph, we can safely assume that US and India are home to the most no of data science users . Russia, UK, People’s Republic of China, Brazil, Germany, France, Canada, and Australia are the close behind.

Gender

Data Science users of different gender identities generally fall into the similar age distributions

Age

## # A tibble: 1 x 2
##   median    sd
##    <dbl> <dbl>
## 1     30  10.5

** Looking at the above graph , The median age is 30 years (plus or minus 10 years).**

From above Graph it is clear that Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there’s a wider age-range of users in the US and UK.

7 Summary

Overall the language most preffered for Data Science skill is Python followed by R & SQL, whereas Analysis followed by Machine learning is software skill which is in demand.

Week 8 - Project 3 - Data Science Skills

Team DeViSal - Act Of Inventing : ’De’babrata Kabiraj, ’Vi’shal Arora, ’Sa’mriti Malhotra

March 22, 2019