LinkedIn is a business-oriented social network where people build their professional profiles. In this project, the correlation between an individual’s industry and his skills was analyzed. The ultimate goal of this project is to build a job recommendation system based on a person’s LinkedIn profile. My initial movement here is to infer a person’s industry from his skills.
The code in GetOnlineData.R is used to connect to the LinkedIn API. However, LinkedIn has changed the availability of most of their API endpoints. Very limited profiles can be accessed.
Fortunately, a LinkedIn data set was published in Reddit in the format of JSON files. The code in GetData.R is used to download the compressed data and unzip them into JSON files. The total volume of the data is 4.29GB.
The code in CleanData.R reads the JSON files and formats them to be dataframes. Each dataframe contains 10 columns and following are a few rows taken as an example.
## positions
## 1 <script type="text/javascript">window.location="http://google.com"</script>, <script type="text/javascript">window.location="http://google.com"</script>, 1990-01-01, TRUE, aaaaaasdddddddd
## 2 NULL
## 3 NULL
## 4 Processing Specialist, TRUE, University of Pittsburgh
## 5 Senior Network Engg, TRUE, Wipro Infotech Ltd
## 6 NULL
## public-profile-url location
## 1 /pub/union-select-1-a/b/b02/36b Albany, New York Area
## 2 /pub/a-a/6/788/620 Greece
## 3 /pub/f-cindy-porras-a/1b/65/893 Greater Atlanta Area
## 4 /pub/nigel-a/14/130/947 Greater Pittsburgh Area
## 5 /pub/psyec-jayakumar-a/10/8aa/575 Chennai Area, India
## 6 /pub/raoofat-a/36/5a6/416 Iran
## first-name num-connections last-name industry
## 1 '+union+select+1/* 0 a Animation
## 2 (A) 0 a Arts and Crafts
## 3 (F) Cindy Porras 3 A Insurance
## 4 (Nigel) 1 A Libraries
## 5 (PSYEC)Jayakumar 2 A Computer Networking
## 6 ) Raoofat 1 A <NA>
## educations skills summary
## 1 NULL NULL <NA>
## 2 skata, 1984-12-31, 1982-01-01 NULL <NA>
## 3 NULL NULL <NA>
## 4 NULL NULL <NA>
## 5 BE NULL <NA>
## 6 NULL NULL <NA>
Inside the code, the dataset is cleaned by only selecting rows with values assigned to all the following columns, “location”, “positions”, “industry”, “skills” and “education”. Following are a few rows as an example of cleaned data. Since only “industry” and “skills” are of primary interests, other irrelevant rows are removed. “Location”, “positions”, and “educations” are kept for future analysis.
## location
## 17 Melbourne Area, Australia
## 31 Saudi Arabia
## positions
## 17 Business development and client relationship management for nabInvest's fixed income investment partners - Antares Fixed Income, Peridiem Global Investors and Wiltshire Capital, Responsible for insitutional sales for nabInvest's specialist investment managers - specifically Antares Fixed Income and Fairview Equity Partners, Responsibility for client servicing and business development for Capital International in the institutional market in Australia and New Zealand, Responsible for fundamental research into economies and financial markets - primarily as inputs into asset allocation process, NA, Senior Investment Specialist, Senior Business Development Manager, Vice President, Manager Global Strategy, Head of Fixed Interest & Economics, 2010-09-01, 2008-04-01, 2001-07-01, 1998-09-01, 1996-04-01, TRUE, NA, NA, NA, NA, nabInvest, nabinvest, Capital National Alliance, National Australia Asset Management, Colonial Investment Management, NA, 2010-09-01, 2008-05-01, 2001-06-01, 1998-08-01
## 31 Gradute, In-Patient services, Out-Patient services, Pharmacist, Pharmacy trinee, Pharmacy trainee, 2007, 2011, 2010, TRUE, NA, NA, Not yet retired, Dr.Sulaiman Alhabib Medical Group, Riyadh Military Hospital, NA, 2011, 2010
## industry
## 17 Investment Management
## 31 Pharmaceuticals
## skills
## 17 Asset Management, Equities, Fixed Income, Asset Allocation, Investments
## 31 Typing, Research, Microsoft Office, Photoshop, Corporate Communications, Blogging, Graphic Design, Flash, Healthcare, Quality Assurance, Teaching, Teamwork, Technology, Logo Design, PowerPoint, Social Networking, Word, Writing, MS Office Suite, Arabic, Excel, English, Facebook, Adobe Acrobat, Books, InDesign
## educations
## 17 The Australian National University, 1986-12-31, 1983-01-01, BEc, Grad Dip Ec, Economics
## 31 King Saud University, 97 high school, 2012-12-31, 2007-12-31, 2007-01-01, 2004-01-01, Bachelor, high school degree, Pharmacy, major
The code in AnalyzeIndustry.R selects the top 20 industries which are the most popular among the individuals included in the cleaned data set. Analyasis will be conducted among the top 20 industries limited by the computational resources.
Within expectation, the number of people working in the “Information Technology and Services” is almost 5 times as that of other industries as people in this industry are more likely to create a LinkedIn profile. Though this phenomenon can be built into the model as a pior probability of an individual’s “industry”, we believe tha adoption rate of LinkedIn will grow, especially in those not information-related industries.
In order to create a balanced data set with an equal number of people in each industry, the code in ExtractIndustryData.R randomly samples 2000 profiles in each industry. Furthermore, the code in SplitData.R splits the total 40000 data samples into a training data set and a testing data set by 50% vs. 50%.
The code in AnalyzeSkill.R extracts the top skills in each industry in terms of the frequency a skill appearing in the “skills” column among the training data set belonging to the same industry. Following are the top 20 skills voted by the data samples in the training data set belonging to the industry of “Accounting”.
Inevitably, skills across the top industries overlap. One example is “Banking” vs. “Financial Services”. Therefore, when the top 20 skills of each industry are aggregated into a singla feature vector, the dimension is 227 much shorter than 400 if no overlap occurs. Following are 10 example skills and their relationships to the 20 industries.
Here binary values are assigned to each element of the feature vector where “1” indicates the skill is required (ranked as the top skills for the industry observed on the training data set) and “0” indicates the skill may be irrelevant.
However, unlike “industry”, “skills” can be input in free styles, which leads to redundancy of the feature vector. For example, “accounting” and “financial accounting” selected as two separate features refer to pretty much the same skill. Therefore, instead of using skills, keywords are extracted from all the skills by breaking phrases into words.
The code in AnalyzeKeyword.R selects the top keywords in each industry. Following are the top 20 keywords appearing in the column of “skills” from the samples belonging to the industry of “Accounting” in the training data set.
But some keywords like “management”, “development” and etc. are widely used and not specific enough to correlate with any specific industries. To include this kind of words in the feature vetor is not very meaningful. So before aggregating the top 20 keywords in the feature vector, the common keywords shared by all the top 20 industries are removed.
Instead of using binary values, each keyword is weighted according to the frequency appearing in the column of “skills” within the samples belonging to the same industry in the training data set. The weight is further normalized by being divided by the sum of the frequencies of all the top keywords in the industry. Following is part of the feature vector to model the industry of “Accounting”. Note that the total weight for each feature vector characterizing a specific indusry should always be 1.
To build an item based recommendation system, the same feature vector that characterizes the individual industries is used to characterize a person. But binary values are assigned no matter each element in the feature vector presents skills or keywords. The cosine distance between the feature vector of the person and that of each industry is calcualted. The industry resulting in the largest cosin distance is recommended. The code in ClassifySubject.R tests the recommendation system on the training data set and following shows the accuracy of using either skills or keywords.
Note that the feature length refers to the number of top skills or keywords extracted from each industry. Since there are overlaps across the 20 industries, the actual length of the feature vector cannot be simply calculated by multiplying by 20. In addition, “keyword” outperforms “skill” in all the ranges of the feature length and the accuracy is still increasing with the increased number of features.
Supervised learning algorithms are then applied to the trainining data set without modeling the individual industries beforehand. The code in IndustryClassifier.R tries different supervised learning algorithms on the training data set and following shows the accuracy by testing on the training data set itself when setting the length of the feature vector to be 20.
Among the four supervised learning algorithms, the linear support vector machine has the best performance. So feature vectors with a range of different lengths are fed into the algorithm for performance evaluation with the training data set.
Evidently keyword based features outperform skill based features. So only keyword based features are used to analyze the testing data set. Both item based recommendation system (ClassifyTesting.R) and linear support vector machine (TestClassifier.R) are applied to infer an individual’s working industry. Following is the result.
Though linear SVM works better in the training data set, especially when increasing the dimension of the feature vector. However, it presents overfitting problem as the performance deteriorates with the testing data set. On the other hand, item based recommendation system presents stability across both the training data set and the testing data set. Moreover, the performance increases with the increased dimension of the feature vector and is expected to perform better with a longer feature vector. One explanation here is a short feature vector may fail to include the keywords included in an individual’s profile. Therefore, the longer the feature vetor, the larger the possibility that an individual’s keywords will be included. The following figure shows the relationship between the dimension of the feature vector versus the number of unknown predictions.
However the number of unknown cases increases a little bit when the dimension of the feature vector increases from 160 to 320. It is because more common words are removed. But since the accuracy is still increasing, we still like longer feature vectors.
The accuracy of 0.4832 of item based reconmmendation system with 320 keywords describing each industry is not satisfying. To diagnose the problem, we plot the confusion matrix of the classification result.
Here we want all the bright colors to concentrate around the diagonal. But we observe some bright squares off the diagonal. For example, “Banking”" has a relatively large proportion to be classified as “Financial Services”. This makes sense as these two areas require quite similar skills and people from one industry will be easily switched to the other. Another example is between “Computer Software” and “Information Technology and Services”. In a job recommendation system, we want to give users more options. So listing all the job opportunities in the industries matching an individual’s skills is actually wanted. Therefore, we want to first group the industries. The code in GroupIndustry.R groups the 20 industries into 17 groups. Following is the dandrogram of the industries based on the distance of feature vectors between any two industries.
Here “Banking” and “Financial Services”, “Computer Software” and “Information Techonology and Services”, “Educational Management” and “Higher Education” are grouped together. With the industry groups, the classification accuracy is recalculted and a little improvement can be observed in the following figure.