Please use data to answer the question, “Which are the most valued data science skills?”
Source | Count |
---|---|
8400 | |
Indeed | 5200 |
SimplyHired | 3800 |
Monster | 3750 |
AngelList | 600 |
Language | Indeed | SimplyHired | Monster | LinkedIn % | Indeed % | SimplyHired % | Monster % | Avg % | GlassDoor Self Reported % 2017 | Difference | |
---|---|---|---|---|---|---|---|---|---|---|---|
Python | 6347 | 3818 | 2888 | 2544 | 74% | 74% | 75% | 68% | 73% | 72% | 1% |
R | 4553 | 3106 | 2393 | 2365 | 53% | 60% | 62% | 63% | 60% | 64% | -4% |
SQL | 3879 | 2628 | 2056 | 1841 | 45% | 51% | 54% | 49% | 50% | 51% | -1% |
Spark | 2169 | 1551 | 1167 | 1062 | 25% | 30% | 30% | 28% | 29% | 27% | 2% |
Hadoop | 2142 | 1578 | 1164 | 1200 | 25% | 31% | 30% | 32% | 30% | 39% | -9% |
Java | 1944 | 1377 | 1059 | 1002 | 23% | 27% | 28% | 27% | 26% | 33% | -7% |
Software | Indeed | SimplyHired | Monster | |
---|---|---|---|---|
machine learning | 5701 | 3439 | 2561 | 2340 |
analysis | 5168 | 3500 | 2668 | 3306 |
statistics | 4893 | 2992 | 2308 | 2399 |
computer science | 4517 | 2739 | 2093 | 1900 |
communication | 3404 | 2344 | 1791 | 2053 |
mathematics | 2605 | 1961 | 1497 | 1815 |
Due to 16K rows and more than 200+ columns, it is taking time and eventually times out. So commented out the code to view the data.
Transformation & Tidying data for Language Skill sheet, before loading into DB
Step1 : For Language skill sheet, we are subsetting the data on first column using
select function from dplyr package. Then using the na.omit function from base package
to omit any rows containing NA in the data frame. then last function we are using to subset
the data frame on rows using the slice function.
Step2 : Using the base function scale to make the axis data scaler for plotting the graph.
Step3 : Merging the original dataframe first column and the newly created scaled dataframe using select function from dplyr & cbing function from base r .
Step4 : Now for sorting , we first calculated the mean and added the new AVGMean column to the dataframe, using mutate and mean functions.
Step5 : Now order based on the AvgMean column , and select the top 15 values, and then using gather function , gather columns into rows for plotting the graph.
Scale function :- In simpler terms without changing the data , we change the scale of the data (axis values while plotting)
language_skills_df<- slice(na.omit(dplyr::select(df_Language_Skills,1:5)),1:37)
languagedf <- dplyr::select(language_skills_df,2:5)
languagemaxs <- apply(languagedf, 2, max)
languagemins <- apply(languagedf, 2, min)
languageScaled_df <- as.data.frame(scale(languagedf, center = languagemins, scale = languagemaxs - languagemins))
language_df <- dplyr::select(language_skills_df,1)
languageSkillSet <- cbind.data.frame(language_df,languageScaled_df)
languageSkillSetDF <- mutate(languageSkillSet,AvgMean = apply(languageScaled_df,1, mean))
orderedLanguageSkillSetDF <- languageSkillSetDF[order(-languageSkillSetDF$AvgMean),]
orderedLanguageSkillSetDF_excludedAvgMean <- head(dplyr::select(orderedLanguageSkillSetDF,1:5),15)
languageSkillSetDF <- head(languageSkillSetDF,15)
orderedLanguageSkillSetDF_excludedAvgMean.long <- gather(orderedLanguageSkillSetDF_excludedAvgMean,variable, value,-1)
Transformation & Tidying data for Software Skill sheet, before loading into DB
Step1 : Using na.omit function remove all rows with NA values.
Step2 : Using filter method , to subset observations i.e. based on certain logic extract rows.
Step3 : Then using slice to further subset the dataframe .
software_skills_df <- na.omit(df_Software_Skills)
software_skills_df <- filter(software_skills_df,Software!="Total" )
software_skills_df <- slice(software_skills_df , 1:15)
Transformation & Tidying data for Software Skill sheet, before loading into DB
Step1 : Using select function, subsetting the dataframe.
step2 : Wrote function to select the questions with only one answer, remove rows with no respondent answers, group by responses to the question, counting the respondents asnswering and finally calculation the percentage for the same. Finally arranginf in descending order.
mcr_df <- df_MCR %>% select(c("GenderSelect", "Country", "Age", "CurrentJobTitleSelect", "EmploymentStatus", "Tenure", "FormalEducation","FirstTrainingSelect","LearningCategorySelftTaught","LearningCategoryOnlineCourses","LearningCategoryWork","LearningCategoryUniversity","LearningCategoryKaggle","LearningCategoryOther"))
mcr_df$Age <- as.numeric(as.character(mcr_df$Age))
# Function for single choice questions : A function to analyze questions where you choose only one answer
chooseOne = function(question, filterData = mcr_df){
filterData %>%
filter(!UQ(sym(question)) == "") %>%
group_by_(question) %>%
summarise(count = n()) %>%
mutate(percent = (count / sum(count)) * 100) %>%
arrange(desc(count))
}
Step1 : Using dbconnect function to make a connection to Google Cloud MySQL database.
## [1] "Connected to mysql on Sun Mar 24 23:53:54 2019"
Step2 : Create and Load Tables in mySQL Google Cloud , below are the tables to which data is loaded.
- Job_Listings_Tbl
- Language_Skills_Tbl
- Software_Skills_Tbl
Step3 : Using various functions from RMySQL package e.g. (dbExistsTable,dbSendQuery,dbRemoveTable and dbWriteTable) to create and remove table and check if table already exist , and to query the database.
Looking at the graph, we can safely assume that in the month of Oct 2018 LinkedIn posted the highest number of job requirements for Data Science .
Looking at the graph above, wecan safely assume that Python, followed by R and then SQL are most preffered language for Data Science in all Job sites and similarly at the end of the specturm C is least preffered language for Data Science.
Looking at the above graphs, we can safelu assume that Analysis is the most preffered Software skill in all Job portal other than LinkedIn, whereas Machine Learning is the most preffered software skill in LinkedIn. On the other end of specturm Data Engineering is the least preffered language in all Job Portals.
FirstTrainingSelect | count | percent |
---|---|---|
Online courses (coursera, udemy, edx, etc.) | 5299 | 36.018216 |
University courses | 4162 | 28.289831 |
Self-taught | 3697 | 25.129146 |
Work | 990 | 6.729201 |
Other | 307 | 2.086732 |
Kaggle competitions | 257 | 1.746873 |
EmploymentStatus | count | percent |
---|---|---|
Employed full-time | 10897 | 65.1890404 |
Not employed, but looking for work | 2110 | 12.6226370 |
Independent contractor, freelancer, or self-employed | 1330 | 7.9564489 |
Not employed, and not looking for work | 924 | 5.5276382 |
Employed part-time | 917 | 5.4857621 |
I prefer not to say | 420 | 2.5125628 |
Retired | 118 | 0.7059105 |
CurrentJobTitleSelect | count | percent |
---|---|---|
Data Scientist | 2433 | 20.5663567 |
Software Developer/Software Engineer | 1759 | 14.8689772 |
Other | 1233 | 10.4226543 |
Data Analyst | 1213 | 10.2535926 |
Scientist/Researcher | 978 | 8.2671175 |
Business Analyst | 796 | 6.7286560 |
Researcher | 619 | 5.2324598 |
Machine Learning Engineer | 617 | 5.2155537 |
Engineer | 552 | 4.6661031 |
Programmer | 462 | 3.9053254 |
Computer Scientist | 335 | 2.8317836 |
Statistician | 289 | 2.4429417 |
DBA/Database Engineer | 187 | 1.5807270 |
Predictive Modeler | 181 | 1.5300085 |
Data Miner | 118 | 0.9974641 |
Operations Research Practitioner | 58 | 0.4902790 |
Tenure | count | percent |
---|---|---|
1 to 2 years | 3424 | 25.302985 |
3 to 5 years | 3355 | 24.793083 |
Less than a year | 2380 | 17.587940 |
More than 10 years | 2028 | 14.986698 |
6 to 10 years | 1714 | 12.666272 |
I don’t write code to analyze data | 631 | 4.663021 |
FormalEducation | count | percent |
---|---|---|
Master’s degree | 6273 | 41.7782218 |
Bachelor’s degree | 4811 | 32.0412920 |
Doctoral degree | 2347 | 15.6310356 |
Some college/university study without earning a bachelor’s degree | 786 | 5.2347652 |
Professional degree | 451 | 3.0036630 |
I did not complete any formal education past high school | 257 | 1.7116217 |
I prefer not to answer | 90 | 0.5994006 |
Country Looking at the above graph, we can safely assume that US and India are home to the most no of data science users . Russia, UK, People’s Republic of China, Brazil, Germany, France, Canada, and Australia are the close behind.
Gender
Data Science users of different gender identities generally fall into the similar age distributions
Age
## # A tibble: 1 x 2
## median sd
## <dbl> <dbl>
## 1 30 10.5
** Looking at the above graph , The median age is 30 years (plus or minus 10 years).**
From above Graph it is clear that Russian and Chinese Data Science users are slightly younger than Data Science users from other countries. Also, there’s a wider age-range of users in the US and UK.
Overall the language most preffered for Data Science skill is Python followed by R & SQL, whereas Analysis followed by Machine learning is software skill which is in demand.