#Satya Narayana Panda
The Analysis Case: Selection of Students for the MBA Program
Jain University was promoted by the Jain University Trust, which was managed by the Jain Group of Institutions (JGI). Headquartered at Bangalore, India, JGI represented a cluster of 85 vibrant educational establishments. Easwaran Iyer, the Dean of Jain University’s Business School, wanted to ensure that they admitted the right set of students to their Master of Business Administration (MBA) program, but he was not sure about the parameters that could be used to identify students who were ideal for this program. Jain University received applications for the MBA program from across India and admitted approximately 400 students to this program every year. There had been a steady increase in the number of applications received by Jain University over the years. The University had reached a stage where it could be very selective in choosing the students for the MBA program. At the beginning of the admissions season (which began in April and stretched till July), Iyer was faced with the same question every year: “Whom to admit?” As Dean, Iyer conducted himself as a “guard” by thoroughly screening the admission- seeking candidates and deciding which candidates to admit or reject. Although there was no penalty if a non- placeable student was selected, it would weigh heavily on the institute’s reputation. A wrong pick could eventually contribute toward an increase in the number of unplaced students as well as a reduction in the median salary. Moreover, there was the possibility of rejecting a placeable candidate. What made Iyer’s job tougher was that he was expected to increase the batch size while also increasing the quality of the admitted set of students. He acknowledged that the MBA admissions process needed much more analytical reasoning, taking multiple criteria into consideration.
The job of the Admissions Committee was to admit the best candidates from among the available lot. The committee considered an applicant’s academic grades and discipline, entrance test score, and work experience (if any). It also judged candidates by their performance in the personal interview. Similar to most MBA admissions committees, the team had the mandate to build a diverse batch. Their objective was to put together a group of different yet similar candidates. The admissions team wanted to understand whether a student’s academic record would have any reflection on the placement status. What could be the possible criteria for assessing/selecting a student who could get placed in a good role, with a good pay packet?
Campus placements marked the final phase of a student’s life at a B-school. The placement team at Jain University had been able to achieve a placement rate of approximately 80% in the past. The team was of the opinion that the remaining students would eventually get placed; however, this process would take longer and might not necessarily happen through campus placement initiatives. As a result, these 20% students might have been a wrong choice at the time of admission.
Patchy placements were the bane of many B-schools in India. The Admissions Committee collected data of the students who had been admitted in 2011 and were placed in 2013. The data (in Excel file named MBA.xlsx) are provided in the supplementary document.
Here I am starting my data and case analysis with the first set of questions and we are using R and libraries to explore the data, understand the data, finding insight out of it and preparing the data for traing and test set for linear modeling and other modeling.
Exploratory data analysis is the process of exploring your data, and it typically includes examining the structure and components of your dataset, the distributions of individual variables, and the relationships between two or more variables. There are several goals of exploratory data analysis, which are: - To determine if there are any problems (missing values, typos, etc) with your dataset. - To determine whether the question you are asking can be answered by the data that you have. It is a good idea to run str() function first on the dataset. This is usually a safe operation in the sense that even with a very large dataset, running str() shouldn’t take too long. It is also useful to look at the “beginning” and “end” of a dataset. This lets us know if the data were read in properly, things are properly formatted, and that everything is there.
Based on the paragraph above, explore the dataset MBA.xlsx. List and explain any issues with this dataset. DO NOT correct anything in the dataset.
This dataset (MBA.xlsx) contains information about MBA students, including their academic performance, entrance test scores, specialization, placement status, and salary.
R code to explore the dataset and identify potential issues -
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Reading the dataset and exploring sthe structure of the data
mbadata <- read_excel("MBA.xlsx")
str(mbadata)## tibble [391 × 26] (S3: tbl_df/tbl/data.frame)
## $ RegNo : num [1:391] 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr [1:391] "M" "M" "M" "M" ...
## $ Gender-B : num [1:391] 0 0 0 0 0 0 1 0 0 1 ...
## $ Percent_SSC : num [1:391] 62 76.3 72 60 61 ...
## $ Board_SSC : chr [1:391] "Others" "ICSE" "Others" "CBSE" ...
## $ Board_CBSE : num [1:391] 0 0 0 1 1 0 0 0 1 1 ...
## $ Board_ICSE : num [1:391] 0 1 0 0 0 1 0 1 0 0 ...
## $ Percent_HSC : num [1:391] 88 75.3 78 63 55 ...
## $ Board_HSC : chr [1:391] "Others" "Others" "Others" "CBSE" ...
## $ Stream_HSC : chr [1:391] "Commerce" "Science" "Commerce" "Arts" ...
## $ Percent_Degree : num [1:391] 52 75.5 66.6 58 54 ...
## $ Course_Degree : chr [1:391] "Science" "Computer Applications" "Engineering" "Management" ...
## $ Degree_Engg : num [1:391] 0 0 1 0 1 0 0 0 0 0 ...
## $ Experience_Yrs : num [1:391] 0 1 0 0 1 0 2 0 0 1 ...
## $ Entrance_Test : chr [1:391] "MAT" "MAT" NA "MAT" ...
## $ S-TEST : num [1:391] 1 1 0 1 1 0 0 1 1 0 ...
## $ Percentile_ET : num [1:391] 55 86.5 0 75 66 ...
## $ Percent_MBA : num [1:391] 58.8 66.3 52.9 57.8 59.4 ...
## $ S-TEST*SCORE : num [1:391] 55 86.5 0 75 66 ...
## $ Specialization_MBA : chr [1:391] "Marketing & HR" "Marketing & Finance" "Marketing & Finance" "Marketing & Finance" ...
## $ Marks_Communication: num [1:391] 50 69 50 54 52 53 63 74 65 50 ...
## $ Marks_Projectwork : num [1:391] 65 70 61 66 65 70 56 72 76 59 ...
## $ Marks_BOCA : num [1:391] 74 75 59 62 67 53 50 50 70 77 ...
## $ Placement : chr [1:391] "Placed" "Placed" "Placed" "Placed" ...
## $ Placement_B : num [1:391] 1 1 1 1 1 1 1 1 1 1 ...
## $ Salary : num [1:391] 270000 200000 240000 250000 180000 300000 260000 235000 425000 240000 ...
## RegNo Gender Gender-B Percent_SSC
## Min. : 1.0 Length:391 Min. :0.0000 Min. :37.00
## 1st Qu.: 98.5 Class :character 1st Qu.:0.0000 1st Qu.:56.00
## Median :196.0 Mode :character Median :0.0000 Median :64.50
## Mean :196.0 Mean :0.3248 Mean :64.65
## 3rd Qu.:293.5 3rd Qu.:1.0000 3rd Qu.:74.00
## Max. :391.0 Max. :1.0000 Max. :87.20
## Board_SSC Board_CBSE Board_ICSE Percent_HSC
## Length:391 Min. :0.000 Min. :0.0000 Min. :40.0
## Class :character 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:54.0
## Mode :character Median :0.000 Median :0.0000 Median :63.0
## Mean :0.289 Mean :0.1969 Mean :63.8
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:72.0
## Max. :1.000 Max. :1.0000 Max. :94.7
## Board_HSC Stream_HSC Percent_Degree Course_Degree
## Length:391 Length:391 Min. :35.00 Length:391
## Class :character Class :character 1st Qu.:57.52 Class :character
## Mode :character Mode :character Median :63.00 Mode :character
## Mean :62.98
## 3rd Qu.:69.00
## Max. :89.00
## Degree_Engg Experience_Yrs Entrance_Test S-TEST
## Min. :0.00000 Min. :0.0000 Length:391 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character 1st Qu.:1.0000
## Median :0.00000 Median :0.0000 Mode :character Median :1.0000
## Mean :0.09463 Mean :0.4783 Mean :0.8286
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :3.0000 Max. :1.0000
## Percentile_ET Percent_MBA S-TEST*SCORE Specialization_MBA
## Min. : 0.00 Min. :50.83 Min. : 0.00 Length:391
## 1st Qu.:41.19 1st Qu.:57.20 1st Qu.:41.19 Class :character
## Median :62.00 Median :61.01 Median :62.00 Mode :character
## Mean :54.93 Mean :61.67 Mean :54.93
## 3rd Qu.:78.00 3rd Qu.:66.02 3rd Qu.:78.00
## Max. :98.69 Max. :77.89 Max. :98.69
## Marks_Communication Marks_Projectwork Marks_BOCA Placement
## Min. :50.00 Min. :50.00 Min. :50.00 Length:391
## 1st Qu.:53.00 1st Qu.:64.00 1st Qu.:57.00 Class :character
## Median :58.00 Median :69.00 Median :63.00 Mode :character
## Mean :60.54 Mean :68.36 Mean :64.38
## 3rd Qu.:67.00 3rd Qu.:74.00 3rd Qu.:72.50
## Max. :88.00 Max. :87.00 Max. :96.00
## Placement_B Salary
## Min. :0.000 Min. : 0
## 1st Qu.:1.000 1st Qu.:172800
## Median :1.000 Median :240000
## Mean :0.798 Mean :219078
## 3rd Qu.:1.000 3rd Qu.:300000
## Max. :1.000 Max. :940000
## RegNo Gender Gender-B Percent_SSC
## 0 0 0 0
## Board_SSC Board_CBSE Board_ICSE Percent_HSC
## 0 0 0 0
## Board_HSC Stream_HSC Percent_Degree Course_Degree
## 0 0 0 0
## Degree_Engg Experience_Yrs Entrance_Test S-TEST
## 0 0 67 0
## Percentile_ET Percent_MBA S-TEST*SCORE Specialization_MBA
## 0 0 0 0
## Marks_Communication Marks_Projectwork Marks_BOCA Placement
## 0 0 0 0
## Placement_B Salary
## 0 0
# 4. Checking for the unique values in categorical columns (to detect typos)
categorical_cols <- c("Gender", "Board_SSC", "Board_HSC", "Stream_HSC",
"Course_Degree", "Entrance_Test", "Specialization_MBA", "Placement")
for (col in categorical_cols) {
print(paste("Unique values in", col, ":"))
print(unique(mbadata[[col]]))
}## [1] "Unique values in Gender :"
## [1] "M" "F"
## [1] "Unique values in Board_SSC :"
## [1] "Others" "ICSE" "CBSE"
## [1] "Unique values in Board_HSC :"
## [1] "Others" "CBSE" "ISC"
## [1] "Unique values in Stream_HSC :"
## [1] "Commerce" "Science" "Arts"
## [1] "Unique values in Course_Degree :"
## [1] "Science" "Computer Applications" "Engineering"
## [4] "Management" "Commerce" "Others"
## [7] "Arts"
## [1] "Unique values in Entrance_Test :"
## [1] "MAT" NA "K-MAT" "CAT" "PGCET" "GCET" "G-MAT" "XAT" "G-SAT"
## [1] "Unique values in Specialization_MBA :"
## [1] "Marketing & HR" "Marketing & Finance" "Marketing & IB"
## [1] "Unique values in Placement :"
## [1] "Placed" "Not Placed"
# 5. Checking for outliers in numerical columns
numeric_cols <- c("Percent_SSC", "Percent_HSC", "Percent_Degree", "Percent_MBA", "Salary")
par(mfrow=c(2,3)) # SetTING the layout for multiple plots
for (col in numeric_cols) {
boxplot(mbadata[[col]], main=col, col="lightblue", horizontal=TRUE)
}
# 6. Checking for incorrect salary assignments (should be NA or 0 for unplaced students)
mbadata %>%
filter(Placement != "Placed") %>%
select(Placement, Salary)## # A tibble: 79 × 2
## Placement Salary
## <chr> <dbl>
## 1 Not Placed 0
## 2 Not Placed 0
## 3 Not Placed 0
## 4 Not Placed 0
## 5 Not Placed 0
## 6 Not Placed 0
## 7 Not Placed 0
## 8 Not Placed 0
## 9 Not Placed 0
## 10 Not Placed 0
## # ℹ 69 more rows
# 7. Checking for values outside expected ranges (e.g., percentages should be 0-100)
mbadata %>%
filter(Percent_SSC > 100 | Percent_SSC < 0 |
Percent_HSC > 100 | Percent_HSC < 0 |
Percent_Degree > 100 | Percent_Degree < 0 |
Percent_MBA > 100 | Percent_MBA < 0)## # A tibble: 0 × 26
## # ℹ 26 variables: RegNo <dbl>, Gender <chr>, Gender-B <dbl>, Percent_SSC <dbl>,
## # Board_SSC <chr>, Board_CBSE <dbl>, Board_ICSE <dbl>, Percent_HSC <dbl>,
## # Board_HSC <chr>, Stream_HSC <chr>, Percent_Degree <dbl>,
## # Course_Degree <chr>, Degree_Engg <dbl>, Experience_Yrs <dbl>,
## # Entrance_Test <chr>, S-TEST <dbl>, Percentile_ET <dbl>, Percent_MBA <dbl>,
## # S-TEST*SCORE <dbl>, Specialization_MBA <chr>, Marks_Communication <dbl>,
## # Marks_Projectwork <dbl>, Marks_BOCA <dbl>, Placement <chr>, …
# Loading necessary library
library(ggplot2)
# Defining function to plot boxplots with highlighted outliers
plot_box_with_outliers <- function(column_name) {
ggplot(mbadata, aes(x = "", y = .data[[column_name]])) +
geom_boxplot(outlier.color = "red", outlier.shape = 16, outlier.size = 3, fill = "lightblue") +
labs(title = paste("Boxplot of", column_name), y = column_name) +
theme_minimal()
}
# Example: Plot for Salary column
plot_box_with_outliers("Salary")numeric_cols <- c("Percent_SSC", "Percent_HSC", "Percent_Degree", "Percent_MBA", "Salary")
# Create and display box plots for each column
for (col in numeric_cols) {
print(plot_box_with_outliers(col))
}Review the variables (i.e. columns) in the dataset.
Answer -
“Can we predict whether an MBA student will be placed based on their academic performance, entrance test scores, and communication skills?”
Variables to be used:
Target Variable (Dependent): Placement_B (1 = Placed, 0 = Not Placed) Predictor Variables (Independent): Percent_SSC (Secondary School Percentage) Percent_HSC (High School Percentage) Percent_Degree (Undergraduate Degree Percentage) Percentile_ET (Entrance Test Percentile) Marks_Communication (Communication Marks) Marks_Projectwork (Project Work Marks) Marks_BOCA (Board of Communication Assessment) Specialization_MBA (MBA Major) Experience_Yrs (Work Experience in Years) Justification:
This question is relevant because it assesses whether academic and soft skills influence placement chances. A classification model (e.g., logistic regression, decision tree, or random forest) can be used to predict placement outcomes.
Answwer - “Can we predict the salary of a student who is placed based on their MBA specialization and past academic performance?”
Variables to be used:
Target Variable (Dependent): Salary (Annual salary of placed students) Predictor Variables (Independent): Percent_SSC (Secondary School Percentage) Percent_HSC (High School Percentage) Percent_Degree (Undergraduate Degree Percentage) Percentile_ET (Entrance Test Percentile) Marks_Communication (Communication Marks) Marks_Projectwork (Project Work Marks) Marks_BOCA (Board of Communication Assessment) Specialization_MBA (Marketing & HR, Finance, etc.) Experience_Yrs (Work Experience in Years)
Justification: While MBA salary predictions are common, using Specialization + Past Academic Scores as predictors is an unusual take. It assumes that a student’s past academic performance and MBA focus area influence their starting salary, which can be modeled using linear regression or other regression models.
Answer - (a) Predicting MBA Placement Status
Question: Can we predict whether an MBA student will be placed based on their academic performance, entrance test scores, and communication skills?
Methods:
Logistic Regression: Since Placement_B is a binary variable (1 = Placed, 0 = Not Placed), logistic regression is a good starting model to estimate the probability of placement based on predictor variables. Decision Trees / Random Forest: These models handle categorical and numerical data well and can capture complex relationships between placement and influencing factors. Random forests provide better generalization by reducing overfitting. Support Vector Machine (SVM): An SVM with a suitable kernel can classify students based on their academic and skill-based features. Neural Networks: If the dataset is large enough, a deep learning model can be trained to capture non-linear relationships. Why?
The problem is a classification task (binary outcome). Logistic regression is interpretable and a good baseline. Decision trees/random forests help capture non-linear relationships and interactions among predictors. SVM and neural networks provide alternative solutions for improved accuracy.
Question: Can we predict the salary of a student who is placed based on their MBA specialization and past academic performance?
Methods:
Linear Regression: Since Salary is a continuous variable, a simple multiple linear regression model can predict it based on academic performance and specialization.
Decision Trees / Random Forest Regression: These methods can model non-linear relationships and interactions in the data better than linear regression.
Gradient Boosting (XGBoost, LightGBM): These models provide high accuracy for structured data, handling missing values and feature importance well.
Neural Networks: A deep learning regression model could be used if the dataset is large enough, capturing complex salary determinants.
Why?
The problem is a regression task (continuous numerical outcome). Linear regression is a simple baseline model. Decision trees and boosting methods can handle categorical variables (e.g., Specialization_MBA) and non-linearity. Neural networks may improve predictions if the dataset is large and complex.(This is very important because for NN we need more data)
Exploring with R code now for each alogorithms
Logistic Regression
# Install caret package if not already installed
if(!require(caret)) {
install.packages("caret")
library(caret)
}## Loading required package: caret
## Loading required package: lattice
# Convert Placement_B to a factor
mbadata$Placement_B <- as.factor(mbadata$Placement_B)
# Splitting the dataset
set.seed(123)
trainIndex <- createDataPartition(mbadata$Placement_B, p = 0.8, list = FALSE)
train_data <- mbadata[trainIndex, ]
test_data <- mbadata[-trainIndex, ]
# Fitting Logistic Regression Model
logit_model <- glm(Placement_B ~ Percent_SSC + Percent_HSC + Percent_Degree +
Percentile_ET + Marks_Communication + Marks_Projectwork,
data = train_data, family = binomial)
# Model summary
summary(logit_model)##
## Call:
## glm(formula = Placement_B ~ Percent_SSC + Percent_HSC + Percent_Degree +
## Percentile_ET + Marks_Communication + Marks_Projectwork,
## family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.800415 1.677193 -1.073 0.2831
## Percent_SSC 0.053036 0.017397 3.048 0.0023 **
## Percent_HSC -0.025486 0.015208 -1.676 0.0938 .
## Percent_Degree 0.015877 0.019588 0.811 0.4176
## Percentile_ET 0.010388 0.004714 2.203 0.0276 *
## Marks_Communication -0.052428 0.021120 -2.482 0.0131 *
## Marks_Projectwork 0.045520 0.022190 2.051 0.0402 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 317.55 on 313 degrees of freedom
## Residual deviance: 294.06 on 307 degrees of freedom
## AIC: 308.06
##
## Number of Fisher Scoring iterations: 4
# Predicting on the test data
pred_probs <- predict(logit_model, test_data, type = "response")
pred_classes <- ifelse(pred_probs > 0.5, 1, 0)
# Evaluating the model accuracy using the confusion matrix way
confusionMatrix(as.factor(pred_classes), test_data$Placement_B)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 1
## 1 15 61
##
## Accuracy : 0.7922
## 95% CI : (0.6846, 0.8763)
## No Information Rate : 0.8052
## P-Value [Acc > NIR] : 0.675485
##
## Kappa : -0.025
##
## Mcnemar's Test P-Value : 0.001154
##
## Sensitivity : 0.00000
## Specificity : 0.98387
## Pos Pred Value : 0.00000
## Neg Pred Value : 0.80263
## Prevalence : 0.19481
## Detection Rate : 0.00000
## Detection Prevalence : 0.01299
## Balanced Accuracy : 0.49194
##
## 'Positive' Class : 0
##
Random Forest for Placement Prediction
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
# Convert Placement_B to a factor
mbadata$Placement_B <- as.factor(mbadata$Placement_B)
# Check class distribution
table(mbadata$Placement_B)##
## 0 1
## 79 312
# Split the dataset using stratified sampling
set.seed(123)
trainIndex <- createDataPartition(mbadata$Placement_B, p = 0.8, list = FALSE)
train_data <- mbadata[trainIndex, ]
test_data <- mbadata[-trainIndex, ]
# Ensure Placement_B is a factor in train and test sets
train_data$Placement_B <- as.factor(train_data$Placement_B)
test_data$Placement_B <- as.factor(test_data$Placement_B)
# Check distribution again
table(train_data$Placement_B)##
## 0 1
## 64 250
##
## 0 1
## 15 62
# Train the Random Forest Model
rf_model <- randomForest(Placement_B ~ Percent_SSC + Percent_HSC + Percent_Degree +
Percentile_ET + Marks_Communication + Marks_Projectwork,
data = train_data, ntree = 100, mtry = 3, importance = TRUE)
# Model summary
print(rf_model)##
## Call:
## randomForest(formula = Placement_B ~ Percent_SSC + Percent_HSC + Percent_Degree + Percentile_ET + Marks_Communication + Marks_Projectwork, data = train_data, ntree = 100, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 21.97%
## Confusion matrix:
## 0 1 class.error
## 0 8 56 0.875
## 1 13 237 0.052
# Predictions
rf_preds <- predict(rf_model, test_data)
# Evaluate model accuracy
confusionMatrix(rf_preds, test_data$Placement_B)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 3
## 1 13 59
##
## Accuracy : 0.7922
## 95% CI : (0.6846, 0.8763)
## No Information Rate : 0.8052
## P-Value [Acc > NIR] : 0.67548
##
## Kappa : 0.1137
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.13333
## Specificity : 0.95161
## Pos Pred Value : 0.40000
## Neg Pred Value : 0.81944
## Prevalence : 0.19481
## Detection Rate : 0.02597
## Detection Prevalence : 0.06494
## Balanced Accuracy : 0.54247
##
## 'Positive' Class : 0
##
Multiple Linear Regression
# Filtering only placed students
placed_data <- filter(mbadata, Placement_B == 1)
# Spliting the dataset
set.seed(123)
trainIndex <- createDataPartition(placed_data$Salary, p = 0.8, list = FALSE)
train_data <- placed_data[trainIndex, ]
test_data <- placed_data[-trainIndex, ]
# Fiting Linear Regression Model
lm_model <- lm(Salary ~ Percent_SSC + Percent_HSC + Percent_Degree +
Percentile_ET + Marks_Communication + Marks_Projectwork + Specialization_MBA,
data = train_data)
# Model summary
summary(lm_model)##
## Call:
## lm(formula = Salary ~ Percent_SSC + Percent_HSC + Percent_Degree +
## Percentile_ET + Marks_Communication + Marks_Projectwork +
## Specialization_MBA, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170910 -52775 -10091 29663 644255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 161819.02 70955.61 2.281 0.023444 *
## Percent_SSC -218.10 669.42 -0.326 0.744855
## Percent_HSC -38.44 618.04 -0.062 0.950459
## Percent_Degree 323.89 773.61 0.419 0.675825
## Percentile_ET 247.96 199.86 1.241 0.215917
## Marks_Communication 2162.55 803.38 2.692 0.007602 **
## Marks_Projectwork -230.08 884.54 -0.260 0.794995
## Specialization_MBAMarketing & HR -44458.41 12611.19 -3.525 0.000506 ***
## Specialization_MBAMarketing & IB -13604.26 34337.60 -0.396 0.692313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92460 on 242 degrees of freedom
## Multiple R-squared: 0.1071, Adjusted R-squared: 0.07758
## F-statistic: 3.628 on 8 and 242 DF, p-value: 0.0005242
# Predicting on the test data
lm_preds <- predict(lm_model, test_data)
# Evaluateing with the RMSE
rmse <- sqrt(mean((lm_preds - test_data$Salary)^2))
print(paste("RMSE:", round(rmse, 2)))## [1] "RMSE: 85999.45"
#Random Forest Regression
# Training the Random Forest Regression Model
rf_reg_model <- randomForest(Salary ~ Percent_SSC + Percent_HSC + Percent_Degree +
Percentile_ET + Marks_Communication + Marks_Projectwork + Specialization_MBA,
data = train_data, ntree = 100, mtry = 3, importance = TRUE)
# Model summary
print(rf_reg_model)##
## Call:
## randomForest(formula = Salary ~ Percent_SSC + Percent_HSC + Percent_Degree + Percentile_ET + Marks_Communication + Marks_Projectwork + Specialization_MBA, data = train_data, ntree = 100, mtry = 3, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 9812739266
## % Var explained: -6.3
# Predicting the Salary
rf_reg_preds <- predict(rf_reg_model, test_data)
# Evaluating the model performance
rmse_rf <- sqrt(mean((rf_reg_preds - test_data$Salary)^2))
print(paste("RMSE (Random Forest):", round(rmse_rf, 2)))## [1] "RMSE (Random Forest): 82461.49"
#Key takeways from all the above results
Logistic Regression & Random Forest predict whether a student gets placed. Linear Regression & Random Forest Regression predict salary. Random Forest generally performs better due to handling of nonlinear relationships and variable interactions.
Apply the following procedures on the variables indicated. a) Get a histogram or bar plot for each variable (except RegNo) depending on the type of variable to show its distributional/frequency properties. Comment on skewness of the distributions of the continues (numerical) variables. b) Get a boxplot for Salary variable (on the y axis) and Specialization_MBA variable (on the x axis). If you see any outliers, replace two highest outliers with the median of the area variable. c) Get a correlation matrix of continuous (numerical) variables using cor() function. d) Create scatter plot matrix of continuous (numerical) variables using plot() function. e) If you were going to predict Salary, which two predictors would you suggest? Why?
R Code providing below
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:dplyr':
##
## combine
# Excluding the 'RegNo' as it's an identifier
num_vars <- mbadata %>% select(where(is.numeric)) %>% select(-RegNo)
cat_vars <- mbadata %>% select(where(is.character))
# Plotting the histograms for numerical variables
hist_plots <- lapply(names(num_vars), function(var) {
ggplot(mbadata, aes(x = .data[[var]])) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
ggtitle(paste("Histogram of", var))
})
# Plotting the bar charts for categorical variables
bar_plots <- lapply(names(cat_vars), function(var) {
ggplot(mbadata, aes(x = .data[[var]])) +
geom_bar(fill = "green", alpha = 0.7) +
ggtitle(paste("Bar Plot of", var))
})
# Displaying the histograms
grid.arrange(grobs = hist_plots, ncol = 2)Comments on Skewness:
If the histogram is right-skewed (positive skew), the mean is greater than the median. If the histogram is left-skewed (negative skew), the median is greater than the mean. If approximately symmetric, both mean and median are close. we can inspect the skewness visually or calculate it with:
## Gender-B Percent_SSC Board_CBSE Board_ICSE
## 0.74819835 -0.06302759 0.93094136 1.52418667
## Percent_HSC Percent_Degree Degree_Engg Experience_Yrs
## 0.29013310 0.05247654 2.76985331 1.27290995
## S-TEST Percentile_ET Percent_MBA S-TEST*SCORE
## -1.74430818 -0.73886985 0.33977048 -0.73886985
## Marks_Communication Marks_Projectwork Marks_BOCA Salary
## 0.73860684 -0.25873771 0.29200068 0.23965130
# Creating the boxplot
ggplot(mbadata, aes(x = Specialization_MBA, y = Salary)) +
geom_boxplot(fill = "orange", alpha = 0.7) +
ggtitle("Boxplot of Salary by Specialization") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# Identifying the outliers
salary_outliers <- boxplot(mbadata$Salary, plot = FALSE)$out
top_outliers <- sort(salary_outliers, decreasing = TRUE)[1:2] # Top 2 outliers
# Replacing the top 2 outliers with median
salary_median <- median(mbadata$Salary, na.rm = TRUE)
mbadata$Salary <- ifelse(mbadata$Salary %in% top_outliers, salary_median, mbadata$Salary)## Gender-B Percent_SSC Board_CBSE Board_ICSE
## Gender-B 1.000000000 0.16875292 0.003574313 0.068524087
## Percent_SSC 0.168752920 1.00000000 -0.100758669 0.033998762
## Board_CBSE 0.003574313 -0.10075867 1.000000000 -0.315716560
## Board_ICSE 0.068524087 0.03399876 -0.315716560 1.000000000
## Percent_HSC 0.143639755 0.39658510 -0.001983207 0.212188778
## Percent_Degree 0.199576030 0.41307152 0.017286269 0.066431310
## Degree_Engg -0.037650495 0.22692218 -0.051911581 -0.028265411
## Experience_Yrs 0.002138339 -0.01523697 -0.008836470 0.059594661
## S-TEST -0.017940380 0.08459661 0.005437720 -0.013749613
## Percentile_ET 0.014040797 0.21151671 0.043719577 0.028909312
## Percent_MBA 0.317448268 0.47563845 -0.090726031 0.096812471
## S-TEST*SCORE 0.014040797 0.21151671 0.043719577 0.028909312
## Marks_Communication 0.251887318 0.47627913 -0.104341706 0.146277347
## Marks_Projectwork 0.201185752 0.13249597 -0.007146218 0.030674247
## Marks_BOCA 0.232778140 0.27159726 0.007208459 0.004605852
## Salary -0.129494038 0.20513435 0.051576910 -0.004936281
## Percent_HSC Percent_Degree Degree_Engg Experience_Yrs
## Gender-B 0.143639755 0.19957603 -0.037650495 0.002138339
## Percent_SSC 0.396585098 0.41307152 0.226922181 -0.015236973
## Board_CBSE -0.001983207 0.01728627 -0.051911581 -0.008836470
## Board_ICSE 0.212188778 0.06643131 -0.028265411 0.059594661
## Percent_HSC 1.000000000 0.33894314 0.033327635 -0.042637940
## Percent_Degree 0.338943143 1.00000000 -0.036044381 -0.029147261
## Degree_Engg 0.033327635 -0.03604438 1.000000000 0.043335149
## Experience_Yrs -0.042637940 -0.02914726 0.043335149 1.000000000
## S-TEST 0.041399307 0.11931595 0.007887498 -0.070866220
## Percentile_ET 0.151457235 0.21312710 0.039164935 -0.009218927
## Percent_MBA 0.380494533 0.44713781 0.126913456 0.160725193
## S-TEST*SCORE 0.151457235 0.21312710 0.039164935 -0.009218927
## Marks_Communication 0.321431613 0.41271632 0.103146888 0.086718058
## Marks_Projectwork 0.160446002 0.19175554 0.047218540 0.142599070
## Marks_BOCA 0.156588633 0.26887591 0.074859402 0.172957193
## Salary 0.095792541 0.09852761 0.104377343 0.142547023
## S-TEST Percentile_ET Percent_MBA S-TEST*SCORE
## Gender-B -0.017940380 0.014040797 0.31744827 0.014040797
## Percent_SSC 0.084596610 0.211516711 0.47563845 0.211516711
## Board_CBSE 0.005437720 0.043719577 -0.09072603 0.043719577
## Board_ICSE -0.013749613 0.028909312 0.09681247 0.028909312
## Percent_HSC 0.041399307 0.151457235 0.38049453 0.151457235
## Percent_Degree 0.119315951 0.213127104 0.44713781 0.213127104
## Degree_Engg 0.007887498 0.039164935 0.12691346 0.039164935
## Experience_Yrs -0.070866220 -0.009218927 0.16072519 -0.009218927
## S-TEST 1.000000000 0.802522425 0.08378289 0.802522425
## Percentile_ET 0.802522425 1.000000000 0.21416061 1.000000000
## Percent_MBA 0.083782890 0.214160613 1.00000000 0.214160613
## S-TEST*SCORE 0.802522425 1.000000000 0.21416061 1.000000000
## Marks_Communication 0.101010089 0.200446535 0.70699926 0.200446535
## Marks_Projectwork 0.123962640 0.146226420 0.43555824 0.146226420
## Marks_BOCA 0.012311678 0.138223555 0.47673650 0.138223555
## Salary 0.037206855 0.150589486 0.17659425 0.150589486
## Marks_Communication Marks_Projectwork Marks_BOCA
## Gender-B 0.25188732 0.201185752 0.232778140
## Percent_SSC 0.47627913 0.132495968 0.271597259
## Board_CBSE -0.10434171 -0.007146218 0.007208459
## Board_ICSE 0.14627735 0.030674247 0.004605852
## Percent_HSC 0.32143161 0.160446002 0.156588633
## Percent_Degree 0.41271632 0.191755539 0.268875914
## Degree_Engg 0.10314689 0.047218540 0.074859402
## Experience_Yrs 0.08671806 0.142599070 0.172957193
## S-TEST 0.10101009 0.123962640 0.012311678
## Percentile_ET 0.20044653 0.146226420 0.138223555
## Percent_MBA 0.70699926 0.435558244 0.476736496
## S-TEST*SCORE 0.20044653 0.146226420 0.138223555
## Marks_Communication 1.00000000 0.308851397 0.210566765
## Marks_Projectwork 0.30885140 1.000000000 0.260200945
## Marks_BOCA 0.21056677 0.260200945 1.000000000
## Salary 0.12806145 0.155142138 0.134111997
## Salary
## Gender-B -0.129494038
## Percent_SSC 0.205134350
## Board_CBSE 0.051576910
## Board_ICSE -0.004936281
## Percent_HSC 0.095792541
## Percent_Degree 0.098527610
## Degree_Engg 0.104377343
## Experience_Yrs 0.142547023
## S-TEST 0.037206855
## Percentile_ET 0.150589486
## Percent_MBA 0.176594251
## S-TEST*SCORE 0.150589486
## Marks_Communication 0.128061448
## Marks_Projectwork 0.155142138
## Marks_BOCA 0.134111997
## Salary 1.000000000
## corrplot 0.95 loaded
Answer - Best Two Predictors for Salary:
Percentile_ET (Entrance Test Percentile) Higher scores indicate stronger candidates, likely leading to better job offers. Marks_Communication (Communication Skills Score) Strong communication skills are crucial for placements and high salary offers.
salary_corr <- cor_matrix["Salary", ]
salary_corr <- sort(abs(salary_corr), decreasing = TRUE)
top_predictors <- names(salary_corr[2:3]) # here I am exxcluding salary itself
print(top_predictors)## [1] "Percent_SSC" "Percent_MBA"
Right-skewed variables (e.g., Salary) indicate high-income outliers. Boxplot showed outliers in Salary, which we replaced with the median. Strong correlation between Salary and Percentile_ET, Marks_Communication. Scatter plots confirm linear relationships between Salary and key predictors.
===========================================================================================END of Quetions 4 =================
Answer -
To predict whether a student will be placed, we need independent variables that could influence placement outcomes. Common predictors include:
SSC_Percentage (10th grade marks) HSC_Percentage (12th grade marks) Degree_Percentage Specialization_MBA Percentile_ET (Entrance test score) Marks_Communication (Communication skills) Work_Experience (Yes/No) The target variable is “Placed” (1 = Placed, 0 = Not Placed).
MBA marks and Placement outcomes are only available after admission, so they should not be used for admission prediction. Instead, we can train two separate models: Admission Model: Uses pre-admission parameters like SSC_Percentage, HSC_Percentage, and Entrance Test Percentile. Placement Model: Uses all available parameters including MBA performance and communication skills.
# Converting the Placement to a binary factor variable
mbadata$Placement <- as.factor(mbadata$Placement)
# Logistic regression with only SSC_Percentage
logistic_model <- glm(mbadata$Placement ~ mbadata$Percent_SSC, data = mbadata, family = binomial)
# Model summary
summary(logistic_model)##
## Call:
## glm(formula = mbadata$Placement ~ mbadata$Percent_SSC, family = binomial,
## data = mbadata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.10800 0.75176 -1.474 0.14052
## mbadata$Percent_SSC 0.03922 0.01196 3.280 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 393.52 on 390 degrees of freedom
## Residual deviance: 382.31 on 389 degrees of freedom
## AIC: 386.31
##
## Number of Fisher Scoring iterations: 4
Observation from the output given below -
The p-value for SSC_Percentage will indicate whether SSC_Percentage significantly predicts placement. Significant positive coefficient means higher SSC_Percentage increases the probability of placement. Model performance should be validated with metrics like AUC (area under curve).
# Predicting the probability for students with SSC_Percentage 60% and 80%
new_data <- data.frame(SSC_Percentage = c(60, 80))
predicted_probs <- predict(logistic_model, new_data, type = "response")## Warning: 'newdata' had 2 rows but variables found have 391 rows
## 1 2 3 4 5 6 7 8
## 0.7897472 0.8682267 0.8475566 0.7764279 0.7831615 0.7405615 0.8371443 0.8261666
## 9 10 11 12 13 14 15 16
## 0.8946457 0.7695468 0.8822183 0.7625186 0.7764279 0.8146129 0.8953827 0.6928745
## 17 18 19 20 21 22 23 24
## 0.8261666 0.5880571 0.8797513 0.8146129 0.8284079 0.8371443 0.7695468 0.7961850
## 25 26 27 28 29 30 31 32
## 0.7011559 0.8317269 0.7173269 0.6928745 0.8061780 0.7011559 0.8574165 0.7625186
## 33 34 35 36 37 38 39 40
## 0.8204624 0.8621440 0.7764279 0.7405615 0.8755405 0.7018129 0.7480248 0.7764279
## 41 42 43 44 45 46 47 48
## 0.8261666 0.6131676 0.8717310 0.7173269 0.7329562 0.7173269 0.8667391 0.8074006
## 49 50 51 52 53 54 55 56
## 0.8593235 0.8597969 0.8550026 0.7595231 0.8261666 0.8317269 0.8146129 0.7818266
## 57 58 59 60 61 62 63 64
## 0.7625186 0.8562620 0.8371443 0.7695468 0.8621440 0.7480248 0.7173269 0.7252106
## 65 66 67 68 69 70 71 72
## 0.8989996 0.8204624 0.8024749 0.7329562 0.8939041 0.8667391 0.8371443 0.8755405
## 73 74 75 76 77 78 79 80
## 0.8621440 0.8916518 0.7173269 0.8989996 0.8574165 0.8989996 0.8792942 0.7405615
## 81 82 83 84 85 86 87 88
## 0.7625186 0.7252106 0.8485672 0.7509699 0.7723169 0.6672878 0.8086173 0.8621440
## 89 90 91 92 93 94 95 96
## 0.6486699 0.8712038 0.8424203 0.7435640 0.7625186 0.6178080 0.8261666 0.7405615
## 97 98 99 100 101 102 103 104
## 0.7961850 0.7625186 0.8317269 0.7011559 0.6549888 0.8317269 0.8731276 0.8885835
## 105 106 107 108 109 110 111 112
## 0.8972052 0.7286358 0.8797513 0.7011559 0.7212858 0.6802167 0.8545157 0.6496523
## 113 114 115 116 117 118 119 120
## 0.8927832 0.8146129 0.8261666 0.8712038 0.7695468 0.8086173 0.8838386 0.8751982
## 121 122 123 124 125 126 127 128
## 0.8742528 0.8371443 0.8525547 0.6496523 0.8375715 0.8797513 0.7480248 0.8851208
## 129 130 131 132 133 134 135 136
## 0.7405615 0.7285582 0.8475566 0.8911203 0.8525547 0.7736932 0.8755405 0.6895263
## 137 138 139 140 141 142 143 144
## 0.8024749 0.7864729 0.7791391 0.7695468 0.8878046 0.8712038 0.8024749 0.7831615
## 145 146 147 148 149 150 151 152
## 0.7173269 0.8785011 0.8261666 0.8878046 0.7405615 0.7695468 0.7625186 0.8086173
## 153 154 155 156 157 158 159 160
## 0.8871781 0.7329562 0.8061780 0.6672878 0.7831615 0.8878046 0.8916518 0.8785011
## 161 162 163 164 165 166 167 168
## 0.7553443 0.7341825 0.7422907 0.6844668 0.7695468 0.7405615 0.8651915 0.7011559
## 169 170 171 172 173 174 175 176
## 0.7695468 0.8261666 0.6180858 0.8525547 0.8776615 0.8024749 0.7695468 0.6844668
## 177 178 179 180 181 182 183 184
## 0.7625186 0.7173269 0.8574165 0.8086173 0.6759365 0.8371443 0.8024749 0.8024749
## 185 186 187 188 189 190 191 192
## 0.8150862 0.8712038 0.8797513 0.7480248 0.7695468 0.7275488 0.7405615 0.8371443
## 193 194 195 196 197 198 199 200
## 0.8055645 0.7480248 0.7011559 0.8315072 0.7764279 0.8371443 0.6844668 0.7974548
## 201 202 203 204 205 206 207 208
## 0.8574165 0.8525547 0.7236448 0.7961850 0.7695468 0.8424203 0.7286358 0.7864729
## 209 210 211 212 213 214 215 216
## 0.8317269 0.7897472 0.8878046 0.7517026 0.8667391 0.8204624 0.8729537 0.8997095
## 217 218 219 220 221 222 223 224
## 0.7329562 0.8559722 0.8141387 0.8371443 0.8525547 0.8366089 0.7897472 0.9091918
## 225 226 227 228 229 230 231 232
## 0.8261666 0.8851208 0.8595131 0.7480248 0.8204624 0.7011559 0.8024749 0.8163831
## 233 234 235 236 237 238 239 240
## 0.7897472 0.7681529 0.8797513 0.6995099 0.8574165 0.7764279 0.7764279 0.8574165
## 241 242 243 244 245 246 247 248
## 0.7831615 0.8916518 0.8605987 0.7329562 0.7695468 0.7553443 0.8244705 0.7405615
## 249 250 251 252 253 254 255 256
## 0.9072304 0.7329562 0.8086173 0.7329562 0.8574165 0.6759365 0.7625186 0.7897472
## 257 258 259 260 261 262 263 264
## 0.8795022 0.7334931 0.8146129 0.7764279 0.8495723 0.8838386 0.6224271 0.6585249
## 265 266 267 268 269 270 271 272
## 0.7897472 0.7522146 0.8885835 0.7864729 0.8621440 0.7553443 0.8989996 0.7831615
## 273 274 275 276 277 278 279 280
## 0.7405615 0.8475566 0.7804858 0.8328217 0.7982770 0.7625186 0.7907872 0.6672878
## 281 282 283 284 285 286 287 288
## 0.7764279 0.7695468 0.9098373 0.8278498 0.7468403 0.7329562 0.8662854 0.8712038
## 289 290 291 292 293 294 295 296
## 0.7093078 0.8083745 0.8444915 0.8621440 0.8712038 0.7764279 0.8261666 0.8755405
## 297 298 299 300 301 302 303 304
## 0.7405615 0.8243002 0.8878046 0.8384234 0.8261666 0.7093078 0.7804858 0.8371443
## 305 306 307 308 309 310 311 312
## 0.7961850 0.6844668 0.9052309 0.7764279 0.8424203 0.8204624 0.8204624 0.8667391
## 313 314 315 316 317 318 319 320
## 0.8317269 0.7764279 0.8261666 0.7173269 0.7553443 0.5877721 0.7764279 0.8755405
## 321 322 323 324 325 326 327 328
## 0.8825439 0.7850204 0.9025051 0.7405615 0.8644579 0.8193041 0.8712038 0.7695468
## 329 330 331 332 333 334 335 336
## 0.8086173 0.8317269 0.8989996 0.8004782 0.7897472 0.7777865 0.7173269 0.7553443
## 337 338 339 340 341 342 343 344
## 0.7831615 0.7480248 0.8621440 0.8204624 0.8876482 0.8621440 0.7831615 0.7480248
## 345 346 347 348 349 350 351 352
## 0.8024749 0.8024749 0.8755405 0.8612560 0.8146129 0.7709348 0.8086173 0.8989996
## 353 354 355 356 357 358 359 360
## 0.6817502 0.7961850 0.7480248 0.6532142 0.8175559 0.8667391 0.8424203 0.5849186
## 361 362 363 364 365 366 367 368
## 0.7480248 0.8916518 0.8475566 0.6496523 0.8851606 0.6953723 0.6672878 0.7764279
## 369 370 371 372 373 374 375 376
## 0.7329562 0.8738211 0.7011559 0.7897472 0.8507712 0.8317269 0.7173269 0.8204624
## 377 378 379 380 381 382 383 384
## 0.6672878 0.8141387 0.7314182 0.9052309 0.7283255 0.8204624 0.7645719 0.7831615
## 385 386 387 388 389 390 391
## 0.8216148 0.8371443 0.7936276 0.8712038 0.8169702 0.8371443 0.8574165
Default classification threshold is 0.5, but the best cut-off should maximize accuracy and balance false positives/negatives.
We use ROC Curve and Youden’s Index to find the optimal cut-off.
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Predictng the probabilities on the dataset
mbadata$predicted_probs <- predict(logistic_model, mbadata, type = "response")
# Computing the ROC curve
roc_curve <- roc(mbadata$Placement, mbadata$predicted_probs)## Setting levels: control = Not Placed, case = Placed
## Setting direction: controls < cases
# Finding the optimal cut-off (Youden's Index)
optimal_cutoff <- coords(roc_curve, "best", ret = "threshold")
optimal_cutoff## threshold
## 1 0.7574337
check after the run:
Variables with p < 0.10 are retained. The model will better predict placement than using SSC_Percentage alone. Model evaluation metrics like AIC, ROC, and confusion matrix should be checked.
## [1] "RegNo" "Gender" "Gender-B"
## [4] "Percent_SSC" "Board_SSC" "Board_CBSE"
## [7] "Board_ICSE" "Percent_HSC" "Board_HSC"
## [10] "Stream_HSC" "Percent_Degree" "Course_Degree"
## [13] "Degree_Engg" "Experience_Yrs" "Entrance_Test"
## [16] "S-TEST" "Percentile_ET" "Percent_MBA"
## [19] "S-TEST*SCORE" "Specialization_MBA" "Marks_Communication"
## [22] "Marks_Projectwork" "Marks_BOCA" "Placement"
## [25] "Placement_B" "Salary" "predicted_probs"
# Full logistic regression model with variables having p < 0.10
logistic_model_full <- glm(Placement ~ Percent_SSC + Percent_HSC + Percent_Degree +
Percentile_ET + Experience_Yrs + Marks_Communication,
data = mbadata, family = binomial)
# Model summary
summary(logistic_model_full)##
## Call:
## glm(formula = Placement ~ Percent_SSC + Percent_HSC + Percent_Degree +
## Percentile_ET + Experience_Yrs + Marks_Communication, family = binomial,
## data = mbadata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.011064 1.129659 -0.010 0.99219
## Percent_SSC 0.048633 0.014961 3.251 0.00115 **
## Percent_HSC -0.001807 0.013047 -0.139 0.88983
## Percent_Degree -0.004098 0.017239 -0.238 0.81210
## Percentile_ET 0.009300 0.004214 2.207 0.02731 *
## Experience_Yrs 0.317577 0.208915 1.520 0.12848
## Marks_Communication -0.032199 0.017434 -1.847 0.06477 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 393.52 on 390 degrees of freedom
## Residual deviance: 372.76 on 384 degrees of freedom
## AIC: 386.76
##
## Number of Fisher Scoring iterations: 4
Cost of misclassifying a non-placeable student = 4 times misclassifying a placeable student. Adjust the cut-off threshold accordingly.
# Adjusting the cut-off using cost ratio (False Positive Cost / False Negative Cost = 4)
cost_sensitive_cutoff <- optimal_cutoff / (1 + (1/4))
cost_sensitive_cutoff## threshold
## 1 0.605947
Summary of Insights -
SSC_Percentage alone is a weak predictor of placement. Best predictors include Degree_Percentage, Entrance Test Percentile, Work Experience, and Communication Marks. Cost-sensitive classification suggests a lower cut-off for placement.
Conclusion -
The analysis provides valuable insights into student placement and salary prediction through the application of both statistical and machine learning models. While the Random Forest model was effective in capturing complex relationships among the variables, the linear model also provided a solid baseline for understanding key predictors, offering interpretability and insight into the relative importance of factors such as Percent_SSC, Percent_HSC, and Percentile_ET. The linear model’s coefficients allowed for a clear understanding of how each feature influences placement and salary outcomes, making it a useful tool for educational institutions. Moving forward, future work should focus on refining feature engineering, incorporating additional data sources, and exploring more advanced modelling techniques, including deep learning and ensemble methods, to further enhance predictive accuracy. These findings provide actionable insights that can help educational institutions improve career counselling, adapt curricula to job market demands, and ensure better alignment between student preparation and industry requirements.