Company Ratings

Company Reviews

As a group of 4th years graduating this semester and a 3rd year looking for a job next year, searching for potential companies is not something that is new to us. As most people would prefer, we want to work somewhere incredible – or at the very least above average.

The information in this dataset is scraped from Indeed.com website containing information about companies and their employees’ ratings and happiness, location, revenue, salaries and a lot of other useful information.

Question: Can we predict the rating of a company based on general information and employee surveys?

Exploratory Data Analysis

First, let’s look at what our data looks like.

Rating, our target variable, describes the overall rating of companies based on external and internal rankings by employees and external industry standards. It is a continuous variable that theoretically ranges from 1-5, but in practice ranges from 1.6 to 5 with a median of 3.5 and a mean of 3.476; the distribution of these ratings is fairly Normal.

Aside from rating (our target variable), we have 15 variables.

reviews: number of reviews left

ceo_approval: 0-1 value describing a % approval rating

ceo_count: number of reviews that were left on the CEO

interview_experience: categorical variable; applicants describe their experience as average, favorable, or unfavorable

interview_difficulty: categorical variable; applicants describe their interviews as easy, medium, or difficult

interview_duration: length of the application process, ranging from 1-2 days to over 1 month

interview_count: number of interview reviews that left

employees: number of employees in the company; ranges from 1 to 10,000+

industry: industry category

revenue: company revenue; ranges from <$1M to >$10B

The following five variables are based on employee ratings on a scale of 1-5 and have a Normal distribution:

wl_balance

benefits

security

management

culture

Variable correlation matrix

As demonstrated by the graph above, the individual variables that seem to have the highest association with rating are ceo_count, interview_count, and reviews.

Now, we can calculate the performance of our companies! First, let’s find the average tech rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.100   3.200   3.500   3.501   3.800   4.900

The companies we will be employed at this summer or after graduation have the following ratings:

Medtronic: 3.9

PwC: 4.0

Oracle: 3.8

All of these are above average – to the point that they are at or above the 3rd quartile. Now, however, we want to see what features could best predict high ratings for our and other companies.

Data Analysis

Random Forest

We decided to do Random Forest to see if we could predict the amount of companies that had a high rating (3.5).

High ratings were given factorized into 1s and low ratings were assigned a 0.

Prevalence:

#Positive class/Total number of Cases
5208/(4566+5208)
## [1] 0.5328422

Since we have about 10,000 observations we decided to do a 80/10/10 partitions to train, tune and test our data

Dimensions of the sets:

Training set:

## [1] 7820   16

Testing set:

## [1] 976  16

Tuning set:

## [1] 978  16

To do the random forest, we decided to optimize 3 inputs

The first one is mtry or the number of variables randomly sampled at each split

##        mtry   OOBError
## 3.OOB     3 0.07276215
## 5.OOB     5 0.07327366
## 10.OOB   10 0.07327366

mtry was chosen to be 3

Then, we tried to optimize ntree or the number of trees grown

The ntree value was chosen to be 580

We also optimized the number of samples drawn each time by trying different numbers from 100-900 and chose 100 as the best option

Lastly, we predicted the values with our model and compare it to the test set

We decided to use the confusion matrix and accuracy to evaluate our model

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction less more
##       less  420   35
##       more   36  485
##                                           
##                Accuracy : 0.9273          
##                  95% CI : (0.9091, 0.9428)
##     No Information Rate : 0.5328          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8539          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9327          
##             Specificity : 0.9211          
##          Pos Pred Value : 0.9309          
##          Neg Pred Value : 0.9231          
##               Precision : 0.9309          
##                  Recall : 0.9327          
##                      F1 : 0.9318          
##              Prevalence : 0.5328          
##          Detection Rate : 0.4969          
##    Detection Prevalence : 0.5338          
##       Balanced Accuracy : 0.9269          
##                                           
##        'Positive' Class : more            
## 

Our accuracy is 92% which is about 40% better than just guessing

We also wanted to determine which features contributed the most to this rating

Clustering

First, we ran the clustering algorithm with 2 centers using most of the numeric variables in our data. To determine how well our clusters performed we divided our between sum of squares by the total sum of squares and got a percentage of 74.6%, which is not bad at all.

Elbow Graph

Next, we decided to make an elbow graph to determine if 2 clusters was the best option, and as you can see it was as 2 is right at the elbow point, and it has the lowest inter-cluster variance inside the cluster that is not at 0.

3D Plot with Management

After that, we created some 3D plots in order to explore our clustered data. Here we used culture and CEO approval as the x and y axes, as these two variables had a high correlation with overall rating. On the z axis is the management rating and the color corresponds to a low or high overall rating.

Here, we see that a high value for Culture and CEO Approval will typically lead to a Higher Rating no matter the cluster. The instances where we saw a low overall rating, despite a high culture rating and CEO approval, were typically near our cutoff of 3.5.

3D Plot with Industry

FInally, we created a 3D plot with industry on the z axis.

##                                                                   rating
## industry                                                           less more
##   Tech                                                              338  458
##   Other                                                             776  792
##   Entertainment                                                     323  384
##   ManufacturingandConstruction                                      650  565
##   Business                                                          368  473
##   Retail                                                            416  398
##   Education                                                         118  605
##   Education and Schools                                               0    0
##   Financial Services\nInformation Technology                          0    0
##   Healthcare                                                       1216 1160
##   Human Resources and Staffing                                        0    0
##   Information Technology\nHealthcare                                  0    0
##   Insurance\nHealthcare                                               0    0
##   Pharmaceuticals                                                     0    0
##   Restaurants, Travel and Leisure\nConsumer Goods and Services        0    0
##   Restaurants, Travel and Leisure\nRestaurants, Travel and Leisure    0    0
##   Unknown                                                           361  373

Here, we see again that a high value for Culture and CEO Approval will typically lead to a Higher Rating no matter the cluster. Also, Education has a pretty large gap between its amount of high ratings compared to low ones whereas all other industries were quite equal between the two.

Comparing the Random Forest to Clustering

In our Random Forest Model, we were able to get quite an accurate model (around 92%) with 520 trees, an mtry of 3, and a sample size of 200. Additionally, we saw some great values for Sensitivity (0.9327) and Specificity (0.9167), showing that our model was able to predict company ratings quite well.

In our Clustering plots, we were able to see that values for Culture and CEO Approval can typically lead to a high rating.

Furthermore, through the correlation matrix, the variable importance plot for our Random Forest, and our 1st 3D plot we saw consistent results with which variables typically corresponded to a high rating – with those being culture, management, job security, and CEO Approval.

Limitations

Some limitations to our project were that:

1.) There is concern about the accuracy of ratings as they came from Indeed.com and are anonymous. Therefore, we think that people who left ratings were those who have really strong feelings about the company (either really positive or really negative)

2.) We collapsed the industries ourselves (from over 40 levels to just 9) given our own interpretations on which categories went together and so if someone else were to change the industry categories they could find different results.

Future Work

We had to remove several columns from our data, pertaining to things like locations, salary, and job positions, due to formatting issues and a lack of cohesion within these columns. So, in the future we’d want to find data that was more standardized with these types of variables in order to allow us to include them in the models. Additionally, we’d want more information on the reviewer, like their demographic information, in order to see if the ratings of a company were consistent among different social groups.