Placement Data EDA & Predcition

Introduction :

The data contains placement records for engineering courses from the year 2013 & 2014.

You can find the data here : https://www.kaggle.com/tejashvi14/engineering-placements-prediction

Importing Data :

## Loading required package: lattice

data <- read.csv("collegePlace.csv")
head(data)

##   Age Gender                        Stream Internships CGPA Hostel
## 1  22   Male Electronics And Communication           1    8      1
## 2  21 Female              Computer Science           0    7      1
## 3  22 Female        Information Technology           1    6      0
## 4  21   Male        Information Technology           0    8      0
## 5  22   Male                    Mechanical           0    8      1
## 6  22   Male Electronics And Communication           0    6      0
##   HistoryOfBacklogs PlacedOrNot
## 1                 1           1
## 2                 1           1
## 3                 0           1
## 4                 1           1
## 5                 0           1
## 6                 0           0

Data Summary :

summary(data)

##       Age           Gender             Stream           Internships    
##  Min.   :19.00   Length:2966        Length:2966        Min.   :0.0000  
##  1st Qu.:21.00   Class :character   Class :character   1st Qu.:0.0000  
##  Median :21.00   Mode  :character   Mode  :character   Median :1.0000  
##  Mean   :21.49                                         Mean   :0.7036  
##  3rd Qu.:22.00                                         3rd Qu.:1.0000  
##  Max.   :30.00                                         Max.   :3.0000  
##       CGPA           Hostel      HistoryOfBacklogs  PlacedOrNot    
##  Min.   :5.000   Min.   :0.000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:6.000   1st Qu.:0.000   1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :7.000   Median :0.000   Median :0.0000    Median :1.0000  
##  Mean   :7.074   Mean   :0.269   Mean   :0.1922    Mean   :0.5526  
##  3rd Qu.:8.000   3rd Qu.:1.000   3rd Qu.:0.0000    3rd Qu.:1.0000  
##  Max.   :9.000   Max.   :1.000   Max.   :1.0000    Max.   :1.0000

From the data we can assume that Gender, Stream, Hostel, Backlog, Placed or not all are factor variables(ie; Categorical variables) so they need to be treated as factor variables.

Internships and CGPA cannot be treated as factors as they are numeric data and does not depict any classes.

Data Appreciation :

The data is amazingly processed and cleaned, does not contain any missing values, as well as is balanced for classes which also resolves the data balancing task.

I have just used as.factor to convert categorical variables as factor for ease of use in R.

Simple EDA & Visualisation :

Checking wether the data is balanced or not (In terms of outcome variable):

plot_a <- ggplot(data = data, aes(PlacedOrNot))
plot_a + geom_bar(color = "black", position = "dodge")

Comparision of individual features with Target:

Age :

plot_b <- ggplot(data = data, aes(Age, fill = PlacedOrNot))
plot_b + geom_bar(color = "black", position = "dodge")

We can see that majority of the people are aged 21 and 22 for both placed and not placed category this is due to fact that convocations and graduation is majorly completed in this Age.

Gender :

First visualising the difference between the number of male and female students.

plot_c <- ggplot(data = data, aes(Gender))
plot_c + geom_bar()

As we can see there is a huge difference between number of male and female students. This feature does not seem to be of more use due to its highly unbalanced sampling.

plot_d <- ggplot(data = data, aes(Gender, fill = PlacedOrNot))
plot_d + geom_bar(color = "black", position = "dodge")

But this graph shows some other insight that success ratio of male students getting placed is higher than female for this particular set of data!

Although no claim can be made as this can be effect of unbalanced sampling and we do not have other data to support the claim.

Stream :

plot_stream <- ggplot(data = data, aes(Stream, fill = PlacedOrNot))

plot_stream + geom_bar(color = "black", position = "dodge") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + coord_flip()

The plot indicates highest placements in Computer Science followed by IT.

Internships :

plot_int <- ggplot(data = data, aes(Internships))

plot_int + geom_bar(color = "black", position = "dodge")

Visualizing Placed or Not placed variable

plot_int <- ggplot(data = data, aes(Internships, fill = PlacedOrNot))

plot_int + geom_bar(color = "black", position = "fill")

Here the fill type bar graph lets us comprehend that the success ratio is higher for greater number of internships, this is not a solid result due to the fact that there are very few people with more than one internships.

CGPA :

COUNT :

plot_gpa <- ggplot(data = data, aes(CGPA))

plot_gpa + geom_bar()

Visualizing Placed or Not placed Outcome with CGPA

plot_gpa <- ggplot(data = data, aes(CGPA, fill = PlacedOrNot))

plot_gpa + geom_bar(color = "black", position = "fill")

Hostel :

plot_hostel <- ggplot(data = data, aes(Hostel, fill = PlacedOrNot))

plot_hostel + geom_bar(color = "black", position = "fill")

In my personal opinion the Hostel variable doesn’t have an huge influence on the outcome, rather it is an influence on the other features like CGPA and as CGPA is already included in the features I will not consider adding hostel feature in the data but will perform some other statistical tests before ruling it out completely.

History of Backlogs :

plot_back <-  ggplot(data = data, aes(HistoryOfBacklogs, fill = PlacedOrNot))

plot_back + geom_bar(color = "black", position = "fill")

Feature Importance :

For Feature importance we need to make some changes in the data such as changing factor to numeric, and unclassing categorical variables(str).

Here I have just trained a basic random forest to get the feature importance.

data3 <- data
data3$Gender <- unclass(data3$Gender)
data3$Stream <- unclass(data3$Stream)
data3$Hostel <- as.numeric(data3$Hostel)
data3$HistoryOfBacklogs <- as.numeric(data3$HistoryOfBacklogs)
data3$PlacedOrNot <- as.numeric(data3$PlacedOrNot)
data3$Gender <- as.numeric(data3$Gender)
data3$Stream <- as.numeric(data3$Stream)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- train(PlacedOrNot~., data=data3, method="rf", trControl=control)
importance <- varImp(model, scale=FALSE)
plot(importance)

From the feature importance plot we can figure out that the most important features for landing a placement are CGPA, Age, Internships and Stream. The other variables such as backlogs, Hostel or gender does not explain the data as much as the other variables.