As future data scientists and analysts, we believe that knowing the latest trends of data science job market is a key to employment. Our goal of this project is to have better insights on most in-demand technical skills and positions and the regions and types of companies that are in need of data scientists.We used a raw dataset of 1911 job descriptions extracted from CVs and turned it into a basetable with 1911 observations and 27 variables (for details,please refer to the Methodology and Results tab).98% of job descriptions are in English (1871 observations), but on very rare occasions, they are also in French. For pre-processing and text mining, we used the following packages and techniques: tm_map, stringr, openNLP, Regex, SnowballC, slam, RWeka, cld2, and Matrix.
OpenNLP
We started with a test run with the openNLP package. However, we quickly realized that these pre-built models will not lead to great results in our particular case. One limitation of these NER models became obvious. If documents such as the job descriptions we were using, are containing many program names and software descriptions, it is almost impossible for the system to determine whether an entity is a company or just a skill. For example, if the person was working at Oracle in Redwood City, the system should detect Oracle as the company/employer. However, if the person was working with an Oracle database management system, it should not detect Oracle as the company/employer but as a skill. In addition to this abiguity the job descriptions also contained mostly lists of programs, skills and tasks and no real sentences which makes it even more deifficult for the classical NER systems to correctly identify and classify the words in unstructured text. With this in mind, we changed our strategy from the pre-built models to a dictionary and rule-based approach. Therefore, we created a dictionary with the classifications we want to extract such as the job title the location and the skills. Moreover, we studied the raw text files in order to identify patterns and define rules on how to extract the entities from the unstructured text. As for some extractions we needed special characters we had to abandon some of the pre-processing steps we have seen in class.
Job Titles
Luckily, most of the job descriptions are starting with the job title. Knowing this, we created a function that checks whether one of the general terms mentioned in our dictionary such as engineer or developer are matched in the unstructured text. If there is a match, we split the whole text into two strings right after this term. Finally, we are taking the job title that contains the term mentioned before and whatever is left to it. This approach is fairly simple but catches most of the job titles. However, it has some drawbacks as it is not able to extract multiple job titles in one document. Based on this approach, we found out that 39%, 20% and 4% of job positions are for developers,engineers and analysts respectively.
Organizations
As for the organisations, the approach is similar to the job title. Knowing the general position of the location in the offers allowed us to split the unstructured text twice and extract the remaining part. In some cases, the location is just not provided. Therefore, we had to decide that whenever the remaining string is longer than 7 words it is highly unlikely that this is a company name and thus set the value to “missing”.
Language
To detect the language of the job description we tested the two packages textcat and cld2. In our case there was no notable performance difference between the packages. However, as the R-community highlights the increased performance of cld2. While having a look at the results, textcat identifies the language for every document but detects strange languages such as rumantsch and catalan which are not true. On the other hand, cld2 gives NA’s for ten of the 1911 documents (mostly short descriptions with many special characters) but correctly identifies the language (English or French) for the rest of the documents. Considering the suggestion of the R-community but also given the results we ended up using the cld2 package.
Location
In order to see which region is in high need of data scientists and analysts, we first reviewed the job descriptions and found the following patterns:1) locations are all in the states 2)city and state are separated by comma 3) city always comes before the state and 4) state has a trailing space meaning that it is located at the end of a line without any characters following it. Based on these findings, We first replaced the punctuations with the spaces and removed all the whitespaces at the end of the states. Once all the texts are seperated by space, we matched the splitted text with the file we created, which includes the states and their abbreviations (ex.“CA” California).If the splitted text matches with the state abbreviation on our manually created file, we obtained the position of the text (usually one word before the state as city comes before the state) and assigned it to the city column. For states, we applied the same Regex technique. However, in this case, we did not adjust the position as we needed the state itself. To avoid extracting any random words that have characters of state abbreviations, we decided to keep the state abbreviations in capital letter. According to our model, California,New York and Texas are in high need of data/tech savy professionals, and West Virginia, Montana and Hawaii are in low need.
Skills
In order to extract the technical skills from the job description, we have used dictionary-based approach. We created a dictionary with the skills related to data analytics and IT,then we have used stringr package to match various skills in the job description with the dictionary using str_detect function. All the necessary pre-processing steps like removal of punctuation, removal of stop words, converting text to lower case, removal of numbers are performed. We found out that top 3 most in demand technical skills are python,java and SQL.
Evaluation
To compare the result of our model we created a comparison set by reading 20 random descriptions and manually writing down the results for each of the 23 categories. Comparing this comparison set with the data frame that resulted from our program, we were able to calculate the accuracy. Although the accuracy of the model reaches 91 percent, we have to keep in mind that this score is heavily dependent on the skills section. If we do not consider the skills section, the accuracy drops to 76 percent, which is a more realistic value of the performance.
Using different concept extraction techniques,we were able to better understand the current job market and would like to share the following key findings with other people who are also actively looking for data scientist jobs.
1.States in high need of data/tech savy professionals: California,New York and Texas.
2.Most desired positions: Developer(39%) and Engineer (20%).
3.Top 3 technical skills: Python,Java and SQL.
1. We did not have enough information about the origin of the documents and consequentially also some other background information.
2. The topic was covered for ten minutes in class and unfortunately help on the internet is limited. If articles are available, they mainly focus on applying the currently available pre-bult NER-systems on texts such as news articles or scientific papers which consist of full sentences. Therefore, we needed to find a solution that solves the problem by our own.
3. To get the information for different categories, we needed different levels of pre-processing. To get the skills, all the pre-processing tasks were needed but for the organisation it was necessary to keep the dashes. This lead to the fact that we needed to load the data multiple times and finally join the results.
We can say that the model performs well on the given type (job description) of unstructured text. However, it is heavily dependent on the structure of these descriptions. Therefore, further developments or improvements should focus on generalisation. Like we have already mentioned before, the pre-built models were not suitable for our kind of problem. Our personal reasearch showed some possibilities that could change that. Firstly, it is possible to chain different pre-built NER-systems. However, this has two downsides in regards to our project. Most of the NER-systems are trained with the same dictionary and therefore per se lead to similar results. Moreover, the NER-systems such as spacyR and the Stanford coreNLP are only partially available in R. As a result, we would have had to swith to different languages such as Java or Python. Secondly, we could have worked with a much larger dictionary to train our own model but this would have contradicted with the task description that demanded the use of only a small dictionary.
Furthermore, we were able to extract company names from the texts but did not have a chance to categorize them based on the industries. Getting top 5 industries that actively use data science will help us make better decisions when we apply for jobs.