The database that we are going to use in this projet consists of placement data of students in a campus.
It includes secondary and higher secondary school percentage and specialisation. It also includes degree specialisation, type and Work experience and salary offers to the employed students.
The database is provided by Jain University Bangalore and it can be found here (https://www.kaggle.com/benroshan/factors-affecting-campus-placement).
The data is made up of 215 students with 18 variables.
Description of the variables:
sl.no- Serial Numbergender- Gender- Male='M',Female='F'ssc_p- Secondary Education percentage- 10th Gradessc_b- Board of Education- Central/ Othershsc_p- Higher Secondary Education percentage- 12th Gradehsc_b- Board of Education- Central/ Othershsc_s- Specialisation in Higher Secondary Educationdegree_p- Degree Percentagedegree_t- Under Graduation(Degree type)- Field of degree educationworkex- Work Experienceetest_p- Employability test percentage (conducted by college)specialisation- Post Graduation(MBA)- Specialisationmba_p- MBA percentagestatus- Status of placement- Placed /Not Placedsalary- Annuary Salary offered by corporate to candidates (Value = Indian Rupia)## 'data.frame': 215 obs. of 15 variables:
## $ sl_no : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "M" "M" "M" "M" ...
## $ ssc_p : num 67 79.3 65 56 85.8 ...
## $ ssc_b : chr "Others" "Central" "Central" "Central" ...
## $ hsc_p : num 91 78.3 68 52 73.6 ...
## $ hsc_b : chr "Others" "Others" "Central" "Central" ...
## $ hsc_s : chr "Commerce" "Science" "Arts" "Science" ...
## $ degree_p : num 58 77.5 64 52 73.3 ...
## $ degree_t : chr "Sci&Tech" "Sci&Tech" "Comm&Mgmt" "Sci&Tech" ...
## $ workex : chr "No" "Yes" "No" "No" ...
## $ etest_p : num 55 86.5 75 66 96.8 ...
## $ specialisation: chr "Mkt&HR" "Mkt&Fin" "Mkt&Fin" "Mkt&HR" ...
## $ mba_p : num 58.8 66.3 57.8 59.4 55.5 ...
## $ status : chr "Placed" "Placed" "Placed" "Not Placed" ...
## $ salary : int 270000 200000 250000 NA 425000 NA NA 252000 231000 NA ...
In order to make the database easier to use we have transformed some variables in factors and renamed all of them.
Morever, in order to improve the graphical representations, we decided to convert the salary in thousand of indian Rupia. See the new range:
## [1] 200 940
There are 67 missing values that represent the people that has "status" = Not Employed.
So they have NA in the variable "salary" and represent the total of missing values in all variables.
## [1] 67
Visualize graphs that illustre the different characteristic of variables present in dataset.
Create plots which show the correlation between type of establishments and variables like gender or score.
Find out the influence of Degree score to the final status of employed of students.
Show the frequency of students with work experience and how it influences the final status.
Present the categories of Employability test score.
Study the trend and features of salary to the employed students.
Fit a Multiple linear model and predict the Amount of Salary.
This table allows to explore the full dataset Placement.
We create a multiplot in order to show distribution of score in several istitution by density and histogram.
We use boxplots to controll outliers in our continuous variables.
Just the Higher and Secondary Education percentage present same outlier although we decided to keep all the values.
The barplots below show the distribution of students in different types of School/Degree and Specialisation.
We are going to study the distribution by gender of the 215 students with the following piechart.
Another important factor variable allows to know if the students have work experience or not.
The target variable of our analysis indicate if a student was employed or not.
This bar plot summurises the frequency of students employed against the not employed by gender.
Generally, the majority of students of our date are males (yellow bar) that have found work (employed).
When the a student is employed, Annuary Salary is offered by corporate.
This graph shows the distribution of salary of employed candidates.
In order to decide which variables could be more influence in our target variables, we decide to study the correlation between Score of schools and Employability score with two different graphs representation.
As both plots show, just Secondary school score is correlated with Degree score and Higher school score.
The results of Employability score are not related with the previous score of schools.
In order to study a pattern inside a pairs of istitutional score, we create 4 scatter plot with the function geom_smooth by the final status of employed.
In order to discover more relations inside the dataset, we start to aggregate data and combine variables.
Definitely, we think that the degree score could be one of the most influential variables in the final status of employed or Not.
Let's study the distribution of degree percentage with a red line which indicate the mean target.
The mean degree score is between 65 and 70 percentage.
The second plot below shows the distribution of degree score by type of university.
The field of degree most chosen is Communications Management.
Moreover the third plot illustrates the average score in different type of University: Science and Technology students get the best score.
In comparison with the final status, is easily to see that degree score highest corresponds with status = Employed (yellow density area).
The vertical dashed lines indicate the mean score in each status.
Specifically, the violion plot belows allows us to discover more information about field of degree in different final status.
The field where employed students get better grades is Science and Technology while the field where not employed students get better grades is Communications Management.
Analysing the box plot of degree scoree in different gender is possible to see that Female students have achieved better results in both of final status.
The Post Graduation (MBA) specialisation is divided into two different field:
The first type of specialisation is the most chosen and the one for which more students will be employed.
In the case of MBA specialisation, the average score of employed and not employed seem very close.
In working life, not only the ability to study is important.
Work experience could be highly requested and fundamental additional skill.
The proportion of not employed students mutuates from 13.5% to 40.4% betweeen work experience or not.
Here there are two pie plots that show how the work experience influences the final status of employed or not:
| Work experience | No Work experience | |
|---|---|---|
| Employed | 64 | 84 |
| Not Employed | 10 | 57 |
The Employability test conducted by college summurise the score of every students in the final test for the job.
The horizontal bar plot below displays the distribution of the students by gender in different quantile categories of Employability score.
The highest range of score is reached mostly by males.
We decided to create a random sample of 15 students in order to visulize how they changed the score between degree score to Employability score.
The majority of students lost positions and a lot of them will not be employed.
The frequency of final status of placement students is presented below:
| Not Employed | Employed |
|---|---|
| 67 | 148 |
| 31.2% | 68.8% |
Summing up, the salary indicates the amount of money offered by corporate to candidates employed.
Consequently, the continuous variable salary contains 67 missing values that will be not display.
The first set of scatter plots would proves how the salary is influenced by the score of different type of schools and it will distinguish students between male and female.
We divided the salary in several categories, in order to present the percentage distribution of student by gender in the followed piramid chart!
There is not much difference between gender; all of students could be reach higher salary.
Even if in the first salary bracket the percentage of male is slightly smaller.
The deviation graph below shows the incidence of istitutional score in the amount of the salary.
There aren't variables that are negative correlated with it.
Continuing to consider only employed students, the table below helps to understand that just few students with degree type = Marketing & Finance succeeded to reach very high salary.
Multiple Linear Regression (MLR) is a statistical technique for finding existence of an association relationship between a dependent variable and several independent variables.
The functional form is given by:
\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + .. + \beta_{p} X_{p} + e\] where:
Let's start to fit the model with Y = Salary.
From the output, the estimated values of the parameters are:
b0 = 168690.94; b1 = -641.63; b2 = 59.53 ; b3 = -2061.50; b4 = 1112.66 and b5 = 3548.13
The regression model is given by:
\[Salary_p = b0 + b1Secondary_School_Score + b2High_School_Score + b3Degree_Score +b4Employability_Score + b5Specialisation_Score\]
\[Salary_p = 168690.94 - 641.631Secondary_School_Score + 59.53High_School_Score - 2061.50Degree_Score + 1112.66Employability_Score + 3548.13Specialisation_Score\]
The Salary would grow by 1112.66 with every one increase score in Employability_Score, provided all other variables are kept stable.
The p-value is grather than the confidence level and with an alpha = 0.1 we reject the null hypothesis Employability_Score = 0 and Specialisation_Score = 0 and they are significative.
##
## Call:
## lm(formula = Salary ~ ., data = Placement.lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104005 -50454 -9766 14722 619162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168690.94 98916.08 1.705 0.0903 .
## Secondary_School_Score -641.63 1015.97 -0.632 0.5287
## High_School_Score 59.53 887.49 0.067 0.9466
## Degree_Score -2061.50 1368.74 -1.506 0.1343
## Employability_Score 1112.66 600.13 1.854 0.0658 .
## Specialisation_Score 3548.13 1593.41 2.227 0.0275 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 91730 on 142 degrees of freedom
## Multiple R-squared: 0.06936, Adjusted R-squared: 0.03659
## F-statistic: 2.117 on 5 and 142 DF, p-value: 0.06682
The four plots show the top 3 most extreme data points labelled with the row numbers of the data in the data set.
The standardized residual is the residual divided by its standard deviation.
Residuals vs Fitted: Used to check the linear relationship assumption. An approximately horizontal line (red line), without distinct patterns is an indication for a linear relationship. Any pattern in the residual plot would indicate incorrect specification of the model. In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship.
Normal Q-Q: Used to examine whether the residuals are normally distributed. It is good if residuals points follow the straight dashed line. In our example, most of the points fall approximately along the reference line, so we can assume normality.
Scale-Location: Used to check the homogeneity of variance of the residuals (homoscedasticity). This plot shows if residuals are spread equally along the ranges of the predictors. Failure to meet this assumption will result in unreliability of the hypothesis tests. It is good if you see a horizontal line with equally spread points. In our case we have approximated indication of homoscedasticity.
Residuals vs Leverage: Used to identify influential cases, that is, extreme values that might influence the regression results when included or excluded from the analysis.
In this section we decided to fit a model in order to predict the amount of salary.
First of all, we filtered the data considering only the students employed with salary.
In order to split data into train and test we used the function createDataPartition.
We are going to assign:
Now we are able to fit our model and discovered the importance of variables.
Here, we made a prediction of salary and plot it.
The most important variable in salary prediction is the Final Score in Employability.
Now, we tried to predict amount of Salary between two virtual students randomly created.
The prediction salary shows that the importance of grade in Employability score (Luca) is more influential than a good score in schools. (Stefan)
Stefan
Luca
## [1] "Salary for Stefan: 304036.45"
## [1] "Salary for Luca: 349414.6"
We are curious to see the salaries higher than 600 and the students who reached it by the comparison of them Employability score and whether they have work experience or not.
Let's include the amount of salary for each one.
Furthemore, we decide to investigate the trend of score in different schools of three more students salaried.
It is easy to see that high salary is not always links with high score: see 151 student ID.
Regarding the amount of salary, we could say that an execellent Job test and experience on work should be the best way in order to reach higher values but not without exceptions.
Then, we decided to conclude our project visulizing this particolar graph which shows the flow of all student accross the different variable like score of degree, gender and work experience.
Naturally, the status of Placement or not represent the last division in the following graph.
We have put the different typologies of Degree because we considered it as the most influance score on final job status.
Finally, we decided to analyze in specific way the distribution of students employed by graduation and specialisation score categories quantile.
As we could imagine, the higher percentage of employed students took an higher score in both of the schools.
Moreover, it is curiously to see that there is an higher percentage of employed candidates who took score more or less between 60 and 70.
This confirm also the idea that not only very studious students can access to work.