The data contains placement records for engineering courses from the year 2013 & 2014.
You can find the data here : https://www.kaggle.com/tejashvi14/engineering-placements-prediction
## Loading required package: lattice
data <- read.csv("collegePlace.csv")
head(data)
## Age Gender Stream Internships CGPA Hostel
## 1 22 Male Electronics And Communication 1 8 1
## 2 21 Female Computer Science 0 7 1
## 3 22 Female Information Technology 1 6 0
## 4 21 Male Information Technology 0 8 0
## 5 22 Male Mechanical 0 8 1
## 6 22 Male Electronics And Communication 0 6 0
## HistoryOfBacklogs PlacedOrNot
## 1 1 1
## 2 1 1
## 3 0 1
## 4 1 1
## 5 0 1
## 6 0 0
Data Summary :
summary(data)
## Age Gender Stream Internships
## Min. :19.00 Length:2966 Length:2966 Min. :0.0000
## 1st Qu.:21.00 Class :character Class :character 1st Qu.:0.0000
## Median :21.00 Mode :character Mode :character Median :1.0000
## Mean :21.49 Mean :0.7036
## 3rd Qu.:22.00 3rd Qu.:1.0000
## Max. :30.00 Max. :3.0000
## CGPA Hostel HistoryOfBacklogs PlacedOrNot
## Min. :5.000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:6.000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :7.000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :7.074 Mean :0.269 Mean :0.1922 Mean :0.5526
## 3rd Qu.:8.000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :9.000 Max. :1.000 Max. :1.0000 Max. :1.0000
From the data we can assume that Gender, Stream, Hostel, Backlog, Placed or not all are factor variables(ie; Categorical variables) so they need to be treated as factor variables.
Internships and CGPA cannot be treated as factors as they are numeric data and does not depict any classes.
The data is amazingly processed and cleaned, does not contain any missing values, as well as is balanced for classes which also resolves the data balancing task.
I have just used as.factor to convert categorical variables as factor for ease of use in R.
plot_a <- ggplot(data = data, aes(PlacedOrNot))
plot_a + geom_bar(color = "black", position = "dodge")
plot_b <- ggplot(data = data, aes(Age, fill = PlacedOrNot))
plot_b + geom_bar(color = "black", position = "dodge")
We can see that majority of the people are aged 21 and 22 for both placed and not placed category this is due to fact that convocations and graduation is majorly completed in this Age.
First visualising the difference between the number of male and female students.
plot_c <- ggplot(data = data, aes(Gender))
plot_c + geom_bar()
As we can see there is a huge difference between number of male and female students. This feature does not seem to be of more use due to its highly unbalanced sampling.
plot_d <- ggplot(data = data, aes(Gender, fill = PlacedOrNot))
plot_d + geom_bar(color = "black", position = "dodge")
But this graph shows some other insight that success ratio of male students getting placed is higher than female for this particular set of data!
Although no claim can be made as this can be effect of unbalanced sampling and we do not have other data to support the claim.
plot_stream <- ggplot(data = data, aes(Stream, fill = PlacedOrNot))
plot_stream + geom_bar(color = "black", position = "dodge") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + coord_flip()
The plot indicates highest placements in Computer Science followed by IT.
plot_int <- ggplot(data = data, aes(Internships))
plot_int + geom_bar(color = "black", position = "dodge")
Visualizing Placed or Not placed variable
plot_int <- ggplot(data = data, aes(Internships, fill = PlacedOrNot))
plot_int + geom_bar(color = "black", position = "fill")
Here the fill type bar graph lets us comprehend that the success ratio is higher for greater number of internships, this is not a solid result due to the fact that there are very few people with more than one internships.
COUNT :
plot_gpa <- ggplot(data = data, aes(CGPA))
plot_gpa + geom_bar()
Visualizing Placed or Not placed Outcome with CGPA
plot_gpa <- ggplot(data = data, aes(CGPA, fill = PlacedOrNot))
plot_gpa + geom_bar(color = "black", position = "fill")
plot_hostel <- ggplot(data = data, aes(Hostel, fill = PlacedOrNot))
plot_hostel + geom_bar(color = "black", position = "fill")
In my personal opinion the Hostel variable doesn’t have an huge influence on the outcome, rather it is an influence on the other features like CGPA and as CGPA is already included in the features I will not consider adding hostel feature in the data but will perform some other statistical tests before ruling it out completely.
plot_back <- ggplot(data = data, aes(HistoryOfBacklogs, fill = PlacedOrNot))
plot_back + geom_bar(color = "black", position = "fill")
For Feature importance we need to make some changes in the data such as changing factor to numeric, and unclassing categorical variables(str).
Here I have just trained a basic random forest to get the feature importance.
data3 <- data
data3$Gender <- unclass(data3$Gender)
data3$Stream <- unclass(data3$Stream)
data3$Hostel <- as.numeric(data3$Hostel)
data3$HistoryOfBacklogs <- as.numeric(data3$HistoryOfBacklogs)
data3$PlacedOrNot <- as.numeric(data3$PlacedOrNot)
data3$Gender <- as.numeric(data3$Gender)
data3$Stream <- as.numeric(data3$Stream)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
model <- train(PlacedOrNot~., data=data3, method="rf", trControl=control)
importance <- varImp(model, scale=FALSE)
plot(importance)
From the feature importance plot we can figure out that the most important features for landing a placement are CGPA, Age, Internships and Stream. The other variables such as backlogs, Hostel or gender does not explain the data as much as the other variables.