Campus Recruitment

Factors influencing Employability

The database that we are going to use in this projet consists of placement data of students in a campus.

It includes secondary and higher secondary school percentage and specialisation. It also includes degree specialisation, type and Work experience and salary offers to the employed students.

The database is provided by Jain University Bangalore and it can be found here (https://www.kaggle.com/benroshan/factors-affecting-campus-placement).

The data is made up of 215 students with 18 variables.

Description of the variables:

sl.no- Serial Number
gender- Gender- Male='M',Female='F'
ssc_p- Secondary Education percentage- 10th Grade
ssc_b- Board of Education- Central/ Others
hsc_p- Higher Secondary Education percentage- 12th Grade
hsc_b- Board of Education- Central/ Others
hsc_s- Specialisation in Higher Secondary Education
degree_p- Degree Percentage
degree_t- Under Graduation(Degree type)- Field of degree education
workex- Work Experience
etest_p- Employability test percentage (conducted by college)
specialisation- Post Graduation(MBA)- Specialisation
mba_p- MBA percentage
status- Status of placement- Placed /Not Placed
salary- Annuary Salary offered by corporate to candidates (Value = Indian Rupia)

## 'data.frame':    215 obs. of  15 variables:
##  $ sl_no         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender        : chr  "M" "M" "M" "M" ...
##  $ ssc_p         : num  67 79.3 65 56 85.8 ...
##  $ ssc_b         : chr  "Others" "Central" "Central" "Central" ...
##  $ hsc_p         : num  91 78.3 68 52 73.6 ...
##  $ hsc_b         : chr  "Others" "Others" "Central" "Central" ...
##  $ hsc_s         : chr  "Commerce" "Science" "Arts" "Science" ...
##  $ degree_p      : num  58 77.5 64 52 73.3 ...
##  $ degree_t      : chr  "Sci&Tech" "Sci&Tech" "Comm&Mgmt" "Sci&Tech" ...
##  $ workex        : chr  "No" "Yes" "No" "No" ...
##  $ etest_p       : num  55 86.5 75 66 96.8 ...
##  $ specialisation: chr  "Mkt&HR" "Mkt&Fin" "Mkt&Fin" "Mkt&HR" ...
##  $ mba_p         : num  58.8 66.3 57.8 59.4 55.5 ...
##  $ status        : chr  "Placed" "Placed" "Placed" "Not Placed" ...
##  $ salary        : int  270000 200000 250000 NA 425000 NA NA 252000 231000 NA ...

In order to make the database easier to use we have transformed some variables in factors and renamed all of them.

Morever, in order to improve the graphical representations, we decided to convert the salary in thousand of indian Rupia. See the new range:

## [1] 200 940

There are 67 missing values that represent the people that has "status" = Not Employed.

So they have NA in the variable "salary" and represent the total of missing values in all variables.

## [1] 67

Project topic

Visualize graphs that illustre the different characteristic of variables present in dataset.
Create plots which show the correlation between type of establishments and variables like gender or score.
Find out the influence of Degree score to the final status of employed of students.
Show the frequency of students with work experience and how it influences the final status.
Present the categories of Employability test score.
Study the trend and features of salary to the employed students.
Fit a Multiple linear model and predict the Amount of Salary.

First view of variables

This table allows to explore the full dataset Placement.

Continuous variables

We create a multiplot in order to show distribution of score in several istitution by density and histogram.

We use boxplots to controll outliers in our continuous variables.

Just the Higher and Secondary Education percentage present same outlier although we decided to keep all the values.

Categorical variables

The barplots below show the distribution of students in different types of School/Degree and Specialisation.

We are going to study the distribution by gender of the 215 students with the following piechart.

Another important factor variable allows to know if the students have work experience or not.

The target variable of our analysis indicate if a student was employed or not.

This bar plot summurises the frequency of students employed against the not employed by gender.

Generally, the majority of students of our date are males (yellow bar) that have found work (employed).

When the a student is employed, Annuary Salary is offered by corporate.

This graph shows the distribution of salary of employed candidates.

Correlation between variables

In order to decide which variables could be more influence in our target variables, we decide to study the correlation between Score of schools and Employability score with two different graphs representation.

As both plots show, just Secondary school score is correlated with Degree score and Higher school score.

The results of Employability score are not related with the previous score of schools.

In order to study a pattern inside a pairs of istitutional score, we create 4 scatter plot with the function geom_smooth by the final status of employed.

Degree score and type

In order to discover more relations inside the dataset, we start to aggregate data and combine variables.

Definitely, we think that the degree score could be one of the most influential variables in the final status of employed or Not.

Let's study the distribution of degree percentage with a red line which indicate the mean target.

The mean degree score is between 65 and 70 percentage.

The second plot below shows the distribution of degree score by type of university.

The field of degree most chosen is Communications Management.

Moreover the third plot illustrates the average score in different type of University: Science and Technology students get the best score.

In comparison with the final status, is easily to see that degree score highest corresponds with status = Employed (yellow density area).

The vertical dashed lines indicate the mean score in each status.

Specifically, the violion plot belows allows us to discover more information about field of degree in different final status.

The field where employed students get better grades is Science and Technology while the field where not employed students get better grades is Communications Management.

Analysing the box plot of degree scoree in different gender is possible to see that Female students have achieved better results in both of final status.

Specialisation

The Post Graduation (MBA) specialisation is divided into two different field:

Marketing and Finance
Marketing and Human resources

The first type of specialisation is the most chosen and the one for which more students will be employed.

In the case of MBA specialisation, the average score of employed and not employed seem very close.

Work Experience

In working life, not only the ability to study is important.

Work experience could be highly requested and fundamental additional skill.

The proportion of not employed students mutuates from 13.5% to 40.4% betweeen work experience or not.

Here there are two pie plots that show how the work experience influences the final status of employed or not:

	Work experience	No Work experience
Employed	64	84
Not Employed	10	57

Employability test score

The Employability test conducted by college summurise the score of every students in the final test for the job.

The horizontal bar plot below displays the distribution of the students by gender in different quantile categories of Employability score.

The highest range of score is reached mostly by males.

We decided to create a random sample of 15 students in order to visulize how they changed the score between degree score to Employability score.

The majority of students lost positions and a lot of them will not be employed.

Salary

The frequency of final status of placement students is presented below:

Not Employed	Employed
67	148
31.2%	68.8%

Summing up, the salary indicates the amount of money offered by corporate to candidates employed.

Consequently, the continuous variable salary contains 67 missing values that will be not display.

The first set of scatter plots would proves how the salary is influenced by the score of different type of schools and it will distinguish students between male and female.

We divided the salary in several categories, in order to present the percentage distribution of student by gender in the followed piramid chart!

There is not much difference between gender; all of students could be reach higher salary.

Even if in the first salary bracket the percentage of male is slightly smaller.

The deviation graph below shows the incidence of istitutional score in the amount of the salary.

There aren't variables that are negative correlated with it.

Continuing to consider only employed students, the table below helps to understand that just few students with degree type = Marketing & Finance succeeded to reach very high salary.

Multiple Linear Regression

Multiple Linear Regression (MLR) is a statistical technique for finding existence of an association relationship between a dependent variable and several independent variables.

The functional form is given by:

\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + .. + \beta_{p} X_{p} + e\] where:

Y is the dependent variable,
X1, X2, Xp are independent variables,
b0 is a constant,
b1, b2, bp are the partial regression coefficients,
e is the error term (residual).

Let's start to fit the model with Y = Salary.

From the output, the estimated values of the parameters are:

b0 = 168690.94; b1 = -641.63; b2 = 59.53 ; b3 = -2061.50; b4 = 1112.66 and b5 = 3548.13

The regression model is given by:

\[Salary_p = b0 + b1Secondary_School_Score + b2High_School_Score + b3Degree_Score +b4Employability_Score + b5Specialisation_Score\]

\[Salary_p = 168690.94 - 641.631Secondary_School_Score + 59.53High_School_Score - 2061.50Degree_Score + 1112.66Employability_Score + 3548.13Specialisation_Score\]

The Salary would grow by 1112.66 with every one increase score in Employability_Score, provided all other variables are kept stable.

The p-value is grather than the confidence level and with an alpha = 0.1 we reject the null hypothesis Employability_Score = 0 and Specialisation_Score = 0 and they are significative.

## 
## Call:
## lm(formula = Salary ~ ., data = Placement.lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104005  -50454   -9766   14722  619162 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)  
## (Intercept)            168690.94   98916.08   1.705   0.0903 .
## Secondary_School_Score   -641.63    1015.97  -0.632   0.5287  
## High_School_Score          59.53     887.49   0.067   0.9466  
## Degree_Score            -2061.50    1368.74  -1.506   0.1343  
## Employability_Score      1112.66     600.13   1.854   0.0658 .
## Specialisation_Score     3548.13    1593.41   2.227   0.0275 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91730 on 142 degrees of freedom
## Multiple R-squared:  0.06936,    Adjusted R-squared:  0.03659 
## F-statistic: 2.117 on 5 and 142 DF,  p-value: 0.06682

The four plots show the top 3 most extreme data points labelled with the row numbers of the data in the data set.

The standardized residual is the residual divided by its standard deviation.

Residuals vs Fitted: Used to check the linear relationship assumption. An approximately horizontal line (red line), without distinct patterns is an indication for a linear relationship. Any pattern in the residual plot would indicate incorrect specification of the model. In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship.

Normal Q-Q: Used to examine whether the residuals are normally distributed. It is good if residuals points follow the straight dashed line. In our example, most of the points fall approximately along the reference line, so we can assume normality.

Scale-Location: Used to check the homogeneity of variance of the residuals (homoscedasticity). This plot shows if residuals are spread equally along the ranges of the predictors. Failure to meet this assumption will result in unreliability of the hypothesis tests. It is good if you see a horizontal line with equally spread points. In our case we have approximated indication of homoscedasticity.

Residuals vs Leverage: Used to identify influential cases, that is, extreme values that might influence the regression results when included or excluded from the analysis.

Prediction linear model

In this section we decided to fit a model in order to predict the amount of salary.

First of all, we filtered the data considering only the students employed with salary.

In order to split data into train and test we used the function createDataPartition.

We are going to assign:

70% on our Train for 105 observations
30% on our Test for 43 observations

Now we are able to fit our model and discovered the importance of variables.

Here, we made a prediction of salary and plot it.

The most important variable in salary prediction is the Final Score in Employability.

Now, we tried to predict amount of Salary between two virtual students randomly created.

The prediction salary shows that the importance of grade in Employability score (Luca) is more influential than a good score in schools. (Stefan)

Stefan

Secondary school percentage = 80
Higher school percentage = 82
Degree school percentage = 86
Employability test percentage = 70
Specialisation percentage = 78

Luca

Secondary school percentage = 70
Higher school percentage = 62
Degree school percentage = 72
Employability test percentage = 88
Specialisation percentage = 76

## [1] "Salary for Stefan: 304036.45"

## [1] "Salary for Luca: 349414.6"

More salaried students

We are curious to see the salaries higher than 600 and the students who reached it by the comparison of them Employability score and whether they have work experience or not.

Let's include the amount of salary for each one.

Furthemore, we decide to investigate the trend of score in different schools of three more students salaried.

It is easy to see that high salary is not always links with high score: see 151 student ID.

Conclusion

Regarding the amount of salary, we could say that an execellent Job test and experience on work should be the best way in order to reach higher values but not without exceptions.

Then, we decided to conclude our project visulizing this particolar graph which shows the flow of all student accross the different variable like score of degree, gender and work experience.

Naturally, the status of Placement or not represent the last division in the following graph.

We have put the different typologies of Degree because we considered it as the most influance score on final job status.

Finally, we decided to analyze in specific way the distribution of students employed by graduation and specialisation score categories quantile.

As we could imagine, the higher percentage of employed students took an higher score in both of the schools.

Moreover, it is curiously to see that there is an higher percentage of employed candidates who took score more or less between 60 and 70.

This confirm also the idea that not only very studious students can access to work.

Campus Recruitment: Advanced Visualisation in R

Matteo Pancaldi, Riccardo Ventura