Company Ratings
Company Reviews
As a group of 4th years graduating this semester and a 3rd year looking for a job next year, searching for potential companies is not something that is new to us. As most people would prefer, we want to work somewhere incredible – or at the very least above average.
The information in this dataset is scraped from Indeed.com website containing information about companies and their employees’ ratings and happiness, location, revenue, salaries and a lot of other useful information.
Question: Can we predict the rating of a company based on general information and employee surveys?
Exploratory Data Analysis
First, let’s look at what our data looks like.
Rating, our target variable, describes the overall rating of companies based on external and internal rankings by employees and external industry standards. It is a continuous variable that theoretically ranges from 1-5, but in practice ranges from 1.6 to 5 with a median of 3.5 and a mean of 3.476; the distribution of these ratings is fairly Normal.
Aside from rating (our target variable), we have 15 variables.
reviews: number of reviews left
ceo_approval: 0-1 value describing a % approval rating
ceo_count: number of reviews that were left on the CEO
interview_experience: categorical variable; applicants describe their experience as average, favorable, or unfavorable
interview_difficulty: categorical variable; applicants describe their interviews as easy, medium, or difficult
interview_duration: length of the application process, ranging from 1-2 days to over 1 month
interview_count: number of interview reviews that left
employees: number of employees in the company; ranges from 1 to 10,000+
industry: industry category
revenue: company revenue; ranges from <$1M to >$10B
The following five variables are based on employee ratings on a scale of 1-5 and have a Normal distribution:
wl_balance
benefits
security
management
culture
Variable correlation matrix
As demonstrated by the graph above, the individual variables that seem to have the highest association with rating are ceo_count, interview_count, and reviews.
Now, we can calculate the performance of our companies! First, let’s find the average tech rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.100 3.200 3.500 3.501 3.800 4.900
The companies we will be employed at this summer or after graduation have the following ratings:
Medtronic: 3.9
PwC: 4.0
Oracle: 3.8
All of these are above average – to the point that they are at or above the 3rd quartile. Now, however, we want to see what features could best predict high ratings for our and other companies.
Data Analysis
Random Forest
We decided to do Random Forest to see if we could predict the amount of companies that had a high rating (3.5).
High ratings were given factorized into 1s and low ratings were assigned a 0.
Prevalence:
#Positive class/Total number of Cases
5208/(4566+5208)## [1] 0.5328422
Since we have about 10,000 observations we decided to do a 80/10/10 partitions to train, tune and test our data
Dimensions of the sets:
Training set:
## [1] 7820 16
Testing set:
## [1] 976 16
Tuning set:
## [1] 978 16
To do the random forest, we decided to optimize 3 inputs
The first one is mtry or the number of variables randomly sampled at each split
## mtry OOBError
## 3.OOB 3 0.07276215
## 5.OOB 5 0.07327366
## 10.OOB 10 0.07327366
mtry was chosen to be 3
Then, we tried to optimize ntree or the number of trees grown
The ntree value was chosen to be 580
We also optimized the number of samples drawn each time by trying different numbers from 100-900 and chose 100 as the best option
Lastly, we predicted the values with our model and compare it to the test set
We decided to use the confusion matrix and accuracy to evaluate our model
## Confusion Matrix and Statistics
##
## Actual
## Prediction less more
## less 420 35
## more 36 485
##
## Accuracy : 0.9273
## 95% CI : (0.9091, 0.9428)
## No Information Rate : 0.5328
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8539
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9327
## Specificity : 0.9211
## Pos Pred Value : 0.9309
## Neg Pred Value : 0.9231
## Precision : 0.9309
## Recall : 0.9327
## F1 : 0.9318
## Prevalence : 0.5328
## Detection Rate : 0.4969
## Detection Prevalence : 0.5338
## Balanced Accuracy : 0.9269
##
## 'Positive' Class : more
##
Our accuracy is 92% which is about 40% better than just guessing
We also wanted to determine which features contributed the most to this rating
Clustering
First, we ran the clustering algorithm with 2 centers using most of the numeric variables in our data. To determine how well our clusters performed we divided our between sum of squares by the total sum of squares and got a percentage of 74.6%, which is not bad at all.
Elbow Graph
Next, we decided to make an elbow graph to determine if 2 clusters was the best option, and as you can see it was as 2 is right at the elbow point, and it has the lowest inter-cluster variance inside the cluster that is not at 0.
3D Plot with Management
After that, we created some 3D plots in order to explore our clustered data. Here we used culture and CEO approval as the x and y axes, as these two variables had a high correlation with overall rating. On the z axis is the management rating and the color corresponds to a low or high overall rating.
Here, we see that a high value for Culture and CEO Approval will typically lead to a Higher Rating no matter the cluster. The instances where we saw a low overall rating, despite a high culture rating and CEO approval, were typically near our cutoff of 3.5.
3D Plot with Industry
FInally, we created a 3D plot with industry on the z axis.
## rating
## industry less more
## Tech 338 458
## Other 776 792
## Entertainment 323 384
## ManufacturingandConstruction 650 565
## Business 368 473
## Retail 416 398
## Education 118 605
## Education and Schools 0 0
## Financial Services\nInformation Technology 0 0
## Healthcare 1216 1160
## Human Resources and Staffing 0 0
## Information Technology\nHealthcare 0 0
## Insurance\nHealthcare 0 0
## Pharmaceuticals 0 0
## Restaurants, Travel and Leisure\nConsumer Goods and Services 0 0
## Restaurants, Travel and Leisure\nRestaurants, Travel and Leisure 0 0
## Unknown 361 373
Here, we see again that a high value for Culture and CEO Approval will typically lead to a Higher Rating no matter the cluster. Also, Education has a pretty large gap between its amount of high ratings compared to low ones whereas all other industries were quite equal between the two.
Comparing the Random Forest to Clustering
In our Random Forest Model, we were able to get quite an accurate model (around 92%) with 520 trees, an mtry of 3, and a sample size of 200. Additionally, we saw some great values for Sensitivity (0.9327) and Specificity (0.9167), showing that our model was able to predict company ratings quite well.
In our Clustering plots, we were able to see that values for Culture and CEO Approval can typically lead to a high rating.
Furthermore, through the correlation matrix, the variable importance plot for our Random Forest, and our 1st 3D plot we saw consistent results with which variables typically corresponded to a high rating – with those being culture, management, job security, and CEO Approval.
Limitations
Some limitations to our project were that:
1.) There is concern about the accuracy of ratings as they came from Indeed.com and are anonymous. Therefore, we think that people who left ratings were those who have really strong feelings about the company (either really positive or really negative)
2.) We collapsed the industries ourselves (from over 40 levels to just 9) given our own interpretations on which categories went together and so if someone else were to change the industry categories they could find different results.
Future Work
We had to remove several columns from our data, pertaining to things like locations, salary, and job positions, due to formatting issues and a lack of cohesion within these columns. So, in the future we’d want to find data that was more standardized with these types of variables in order to allow us to include them in the models. Additionally, we’d want more information on the reviewer, like their demographic information, in order to see if the ratings of a company were consistent among different social groups.