1 Placement Prediction - DA Cohort-1

In this analysis, we will analyse data of 257 Upgrad Cohort-1 students out of which approx. 40 had gotten placements (either a new job or an internal transition into analytics).

The objective of the analysis is to identify the important variables which affect the likelihood of a student getting placed and build a (classification) model to predict the same.

Note: The data at hand is quite small (~250 observations) for predictive modeling and may not generalise well to unseen data in the future. This has been considered in choice of the model as well.

1.1 Data Understanding and Cleaning

We have data from three sources - 1. Application data (education, company, years of experience etc.), 2. Salary (Current CTC) and 3. Placement (whether placed or not).

Importing and cleaning the three datasets, you can ignore the section below and move to Exploratory Data Analysis.

1.2 Exploratory Data Analysis

Let’s identify the main variables which indicate differences between the placed and the unplaced students.

Among 257 students, 40 (or about 16%) had gotten placed. Let’s compare the two types of students (encoding: job=1, no job = 0)

## # A tibble: 2 x 2
##      job median(AGE)
##   <fctr>       <dbl>
## 1      0          33
## 2      1          26

The median age of placed students is 26, unplaced is 33. There’s also a clear difference between total years of experience and Current CTC also.

## # A tibble: 2 x 2
##      job median(Total.Experience....yrs.)
##   <fctr>                            <dbl>
## 1      0                                7
## 2      1                                3

The median CTC of placed students is 6.5 lacs, unplaced is 9.5 lacs.

## # A tibble: 2 x 2
##      job median(Current.CTC, na.rm = T)
##   <fctr>                          <dbl>
## 1      0                        1050000
## 2      1                         600000

The median years since graduation of placed students is 4.18, unplaced is 9 years.

## # A tibble: 2 x 2
##      job median(years_since_passout_1)
##   <fctr>                         <dbl>
## 1      0                          9.08
## 2      1                          4.16

One can conduct similar analysis on other variables as well, though that has been omitted in this report for brevity.

1.3 Modeling

Let’s now build some classification models to predict the two classes - job/no-job (or 1/0).

Considering that the data available is limited and that we want the model to be easily intrepretable, let’s try decision trees first.

Reading the tree is simple - go left if the condition is true, else right. This tree tells that If the total work experience is greater than 3.1 years, there is low chance of getting placed. If it is less than 3.1 and the ‘Profile Score’ is less than 0.52, there are good chances of getting placed.

Note: Variables such as ‘Profile Score, Company Score’ etc. were derived using the original variables, though they were often giving countre intuitive results (as in this case). In the final model, we may choose to not use any of the derived metrics.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  8
##          1  4  4
##                                           
##                Accuracy : 0.8442          
##                  95% CI : (0.7436, 0.9168)
##     No Information Rate : 0.8442          
##     P-Value [Acc > NIR] : 0.5762          
##                                           
##                   Kappa : 0.3145          
##  Mcnemar's Test P-Value : 0.3865          
##                                           
##             Sensitivity : 0.33333         
##             Specificity : 0.93846         
##          Pos Pred Value : 0.50000         
##          Neg Pred Value : 0.88406         
##              Prevalence : 0.15584         
##          Detection Rate : 0.05195         
##    Detection Prevalence : 0.10390         
##       Balanced Accuracy : 0.63590         
##                                           
##        'Positive' Class : 1               
##

The accuracy of this model is about 84%, and sensitivity is 33%. This means that it will be able to correctly spot 33% of students who will get placed.

Since we are interested in spotting the ones who will get jobs (so that we may give them scholarships/provide them additional job support etc.), we are looking to maximise sensitivity.

This tree though is probably too simple - let’s try a more complex one.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 58  6
##          1  7  6
##                                           
##                Accuracy : 0.8312          
##                  95% CI : (0.7286, 0.9069)
##     No Information Rate : 0.8442          
##     P-Value [Acc > NIR] : 0.6911          
##                                           
##                   Kappa : 0.3794          
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.50000         
##             Specificity : 0.89231         
##          Pos Pred Value : 0.46154         
##          Neg Pred Value : 0.90625         
##              Prevalence : 0.15584         
##          Detection Rate : 0.07792         
##    Detection Prevalence : 0.16883         
##       Balanced Accuracy : 0.69615         
##                                           
##        'Positive' Class : 1               
##

This one looks better, and its sensitivity is 50% (on test/unseen data points or students). Though there is a problem - College score, Field of study score etc. are now a part of the model in a counter intuitive way.

For example, it says that if the Fos Score is 1 (field of study score is 1 if the person is from analytics/stats/CS background), the chances of job are low.

In general, since these are derived from other variables and are giving slightly counter inuitive results, let’s avoid using them and try some new models using only the original variables.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 57 12
##          1  8  0
##                                           
##                Accuracy : 0.7403          
##                  95% CI : (0.6277, 0.8336)
##     No Information Rate : 0.8442          
##     P-Value [Acc > NIR] : 0.9939          
##                                           
##                   Kappa : -0.1424         
##  Mcnemar's Test P-Value : 0.5023          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.8769          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.8261          
##              Prevalence : 0.1558          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.1039          
##       Balanced Accuracy : 0.4385          
##                                           
##        'Positive' Class : 1               
##

By removing the derived variables, the results have worsened (can correctly spot only 8% ‘good’ students).

In general, the trees shown above have two problems - 1. low sensitivity (we would want something around 50% in the best case) and 2. risk of overfitting the limited data we have.

To solve these problems, let’s try random forests which combine multiple (~500) decision trees together and takes their majority vote (a person will be predicted to be 1/0 only if more than 50% trees predict it to be 1/0, hence reducing the risk of ‘overfitting’).

Apart from better prediction, we can use them to assign a numeric score to each candidate (say between 0 and 100). The ones with higher scores are likely to get placed.

Firstly, we choose 25% probability cutoff (each student is assigned a probability of ‘job’). The plot below tells us that p > 25% gives us sensitivity of around 50%, while also giving decently high overall accuracy (>70%).

The predictions on test data show that this model performs much better - we get > 50% sensitivity while getting decently high accuracy as well.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50  7
##          1 15  5
##                                        
##                Accuracy : 0.7143       
##                  95% CI : (0.6, 0.8115)
##     No Information Rate : 0.8442       
##     P-Value [Acc > NIR] : 0.9988       
##                                        
##                   Kappa : 0.1462       
##  Mcnemar's Test P-Value : 0.1356       
##                                        
##             Sensitivity : 0.41667      
##             Specificity : 0.76923      
##          Pos Pred Value : 0.25000      
##          Neg Pred Value : 0.87719      
##              Prevalence : 0.15584      
##          Detection Rate : 0.06494      
##    Detection Prevalence : 0.25974      
##       Balanced Accuracy : 0.59295      
##                                        
##        'Positive' Class : 1            
##

One can also look at the important variables in the model. The top 5 predictor variables are shown in the table below.

The two rows correspond to job = 0 and 1 respectively, and the columns contain the medians of the important variables.

We can see that the difference between the median age, work experience, graduation score, CTC etc. are significant.

Comparing medians of 1s and 0s
job	median(Grad.Score)	median(Current.CTC)	median(Total.Experience….yrs.)	median(AGE)	median(years_since_passout_1)
0	0.7106	950000	7	33	9.08
1	0.7385	655000	3	26	4.16

Let’s now look at the ‘scores’ provided by the model to each student. The mean score is 16%, which is expected because 16% people have gotten placed.

In our case, we have chosen 25% as the ‘cutoff’ - anyone having a score above 25% is likely to get placed and the ones having extremely low scores are extremely unlikely to get (or opt for) one.

Thus, the scores help us quantify each candidate’s chances of getting a job. To validate that the scores make sense, we can compare the average scores of placed and unplaced students.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0100  0.0460  0.1577  0.1640  0.9080

Comparing average scores of 0s and 1s
job	round(mean(score), 2)
0	0.08
1	0.58

The average scores of placed and unplaced students are 0.58 and 0.09 respectively, thus roughly validating the scores.

To summarise, the most important predictors are Graduation score, age, years since passout, current CTC and total experience.

1.4 Model Deployment and Next Steps

To use the model, we can use the application data (combined with current CTC, since that is not filled in the application form) as the input, run some data cleaning steps and predict the score of each applicant. The ones having scores > 0.25 can be classified as ‘high placement likelihood’ students and can be given better placement assistance.

1.5 Suggestions to Improve the Model

Graduations score being an important predictor indicates that other academic variables, such as marks in 10th and 12th and rank in AIEEE etc. can be important predictors as well. We can modify our forms to include that information.