In this project we are going to Analyze trends in the existing data and predict the chances of admission based on factors like test scores, SOP, LOR, and GPA by building supervised models.
The data set has 500 observations and 9 variables and can be found here
The variables include details like the scores obtained by the students in their respective tests like GRE,TOEFL and other factors like SOP, LOR and research experience prior to the application.
We will be using supervised learning methods like linear regression and neural networks to predict the same.
packages required
#install.packages("rpart.plot")
#install.packages("caret")
#install.packages("rattle")
#install.packages("ROCR")
#install.packages("randomForest")
#install.packages("reshape2")
#install.packages("neuralnet")
#install.packages("caTools")
library(rpart)
library(rpart.plot)
library(lattice)
library(caret)
library(rattle)
library(ROCR)
library(randomForest)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(reshape2)
library(neuralnet)
library(caTools)
Importing the Dataset
data <- read.csv('Admission_Predict_Ver1.1.csv')
Viewing the Dataset
We can observe the variables in the dataset by using the head() and str() function.
head(data)
## SerialNo. GREScore TOEFLScore UniversityRating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## ChanceofAdmit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
str(data)
## 'data.frame': 500 obs. of 9 variables:
## $ SerialNo. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GREScore : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFLScore : int 118 107 104 110 103 115 109 101 102 108 ...
## $ UniversityRating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ ChanceofAdmit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
summary(data)
## SerialNo. GREScore TOEFLScore UniversityRating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## ChanceofAdmit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
We can see that we are having numerical data from research which is having binary data as 1 and 0.
```{ r NA values}
colSums(is.na(data))
we can see from the checking result, there are no missing values in each columns of our dataset. So now our dataset is complete and ready to be used to do further analysis.
**Distributions**
```r
ggplot(data = data, aes(x=ChanceofAdmit)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s see the Normal Qualtile Plot for Chance of Admit in our dataset.
qqnorm(data$ChanceofAdmit)
qqline(data$ChanceofAdmit, col = 2)
Relationships
pairs(data[-7])
cor.test(~ CGPA + ChanceofAdmit, data=data,)
##
## Pearson's product-moment correlation
##
## data: CGPA and ChanceofAdmit
## t = 41.855, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8613745 0.9004286
## sample estimates:
## cor
## 0.8824126
cor.test(~ TOEFLScore +ChanceofAdmit , data=data,)
##
## Pearson's product-moment correlation
##
## data: TOEFLScore and ChanceofAdmit
## t = 28.972, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7571359 0.8227603
## sample estimates:
## cor
## 0.7922276
cor.test(~ CGPA + GREScore, data=data,)
##
## Pearson's product-moment correlation
##
## data: CGPA and GREScore
## t = 32.686, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7958222 0.8518743
## sample estimates:
## cor
## 0.825878
cor.test(~ GREScore + ChanceofAdmit, data=data,)
##
## Pearson's product-moment correlation
##
## data: GREScore and ChanceofAdmit
## t = 30.862, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7779406 0.8384601
## sample estimates:
## cor
## 0.8103506
cor.test(~ CGPA + TOEFLScore, data=data,)
##
## Pearson's product-moment correlation
##
## data: CGPA and TOEFLScore
## t = 30.887, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7781969 0.8386529
## sample estimates:
## cor
## 0.8105735
cor.test(~ TOEFLScore + GREScore, data=data,)
##
## Pearson's product-moment correlation
##
## data: TOEFLScore and GREScore
## t = 32.852, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7973476 0.8530152
## sample estimates:
## cor
## 0.8272004
ggplot( data=data, aes(x= ChanceofAdmit, y=CGPA))+ geom_point()
ggplot( data=data, aes(x= ChanceofAdmit, y=CGPA))+ geom_point()
Here we can see dependency between CGPA and chances of admit . Following observations can be made from this plot: 1. Higher the CGPA a student has, there are higher chances that he will get an admit to his desired university. 2. But there are some cases where student has higher CGPA but he is not getting an admit , that is because it depends on other parameters like GRE, TOEFL and University Rating.
In order to obtain insights through visualization we are using correlation matrix and scatter plot.
Visualizations through Correlation:
## [1] "SerialNo." "GREScore" "TOEFLScore" "UniversityRating"
## [5] "SOP" "LOR" "CGPA" "Research"
## [9] "ChanceofAdmit"
## Warning: Use of `data$ChanceofAdmit` is discouraged. Use `ChanceofAdmit`
## instead.
## Warning: Use of `data$TOEFLScore` is discouraged. Use `TOEFLScore` instead.
As you can see higher the TOEFL score, usually there is higher chance of getting an admit. But TOEFL score is not the only parameter having an influence on chance of admit that’s why there are some cases in which a student has higher TOEFL score but still he didn’t receive an admit due to other reasons like GRE, CGPA influencing chances of admit.
Study correlation among numeric columns
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
Visualize correlation As we can see this is the correlation plot among all the parameters used in this analysis. Following obeservations can be seen in the above plot- 1. Darker the dot is in blue color, stronger is the interdependency among two parameters. 2. As we can see from the ;plot, if a person has a higher GRE then there are very high chances that his CGPA will also be on the higher sid since the color is dark blue. 3. Another example is CGPA and Chances of admit, we can see there is a dark blue dot there which implies if a person has higher CGPA then his chance of admit would also be higher. Scatter plot of research vs chance of admission
## Warning: Use of `data$ChanceofAdmit` is discouraged. Use `ChanceofAdmit`
## instead.
## Warning: Use of `data$GREScore` is discouraged. Use `GREScore` instead.
## Warning: Use of `data$UniversityRating` is discouraged. Use `UniversityRating`
## instead.
This scatter plot depicts relationship between GRE score and chances of admit. Following observations can be made from this plot: 1. Here there is another factor University rating which can be seen affecting the chances of admit. 2. As we can see, higher the GRE score doesn’t mean higher the chances of admit. It actually depends on the university he is applying. If he is applying to lower ranked University then chances of him getting an admit get increased.
Train the machine learning model
Linear regression is an attempt to model the relationship between two variables by fitting a linear equation to observed data, where one variable is considered to be an explanatory variable and the other as a dependent variable.In our model we have variables such as GRE, TOFEL, CGPA, SOP, LOR and Research which for the explanatory variable where as the ChanceOfAdmit is the dependent variable. From using this model, we can see that the variable are directly proportional to the ChanceToAdmit, i.e. the chances of admit increases when we have high Gre, tofel or CGPA.
Moreover, the lowest (Mean Square Mean) depicts that the predicted value is closer to the actual value indicating it to be as a perfect fit.
library(caTools)
sample <- sample.split(data $GREScore, SplitRatio = 0.7)
train <- subset(data, sample == T)
test <- subset(data, sample == F)
model <- lm(ChanceofAdmit ~ GREScore + TOEFLScore+ UniversityRating + SOP + LOR + CGPA + Research, data = train)
summary(model)
##
## Call:
## lm(formula = ChanceofAdmit ~ GREScore + TOEFLScore + UniversityRating +
## SOP + LOR + CGPA + Research, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26889 -0.02366 0.00782 0.03238 0.15299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1800774 0.1200963 -9.826 < 2e-16 ***
## GREScore 0.0017699 0.0005852 3.025 0.00268 **
## TOEFLScore 0.0029435 0.0010322 2.852 0.00461 **
## UniversityRating 0.0059116 0.0044917 1.316 0.18902
## SOP 0.0061649 0.0053616 1.150 0.25102
## LOR 0.0153605 0.0048487 3.168 0.00167 **
## CGPA 0.1074420 0.0113393 9.475 < 2e-16 ***
## Research 0.0244045 0.0075856 3.217 0.00142 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05787 on 341 degrees of freedom
## Multiple R-squared: 0.8273, Adjusted R-squared: 0.8237
## F-statistic: 233.3 on 7 and 341 DF, p-value: < 2.2e-16
predicted <- predict(model, test)
data2 <- data.frame(predicted, test$ChanceofAdmit)
head(data2)
## predicted test.ChanceofAdmit
## 1 0.9454374 0.92
## 9 0.5554829 0.50
## 11 0.7348171 0.52
## 12 0.8342141 0.84
## 13 0.8496717 0.78
## 15 0.6519425 0.61
MSE <- mean((predicted - test$ChanceofAdmit)^2)
MSE
## [1] 0.004260862
m <- model.matrix(~ChanceofAdmit + GREScore + TOEFLScore + UniversityRating + SOP + LOR + CGPA + Research, data = train)
f <- ChanceofAdmit ~ GREScore + TOEFLScore + UniversityRating + SOP + LOR + CGPA + Research
*** BY USING NEURAL NETWORKS*** Neural Network is used where we have large sets of data and accuracy is the most important factor in predicting a result.Since our model predicts chances of admit for a student applying to different universities we used Neural Network since we require maximum accuracy from our analysis.
*** BY USING NEURAL NETWORKS***
## 'data.frame': 151 obs. of 9 variables:
## $ SerialNo. : int 1 9 11 12 13 15 18 20 21 32 ...
## $ GREScore : int 337 302 325 327 328 311 319 303 312 327 ...
## $ TOEFLScore : int 118 102 106 111 112 104 106 102 107 103 ...
## $ UniversityRating: int 4 1 3 4 4 3 3 3 3 3 ...
## $ SOP : num 4.5 2 3.5 4 4 3.5 4 3.5 3 4 ...
## $ LOR : num 4.5 1.5 4 4.5 4.5 2 3 3 2 4 ...
## $ CGPA : num 9.65 8 8.4 9 9.1 8.2 8 8.5 7.9 8.3 ...
## $ Research : int 1 0 1 1 1 1 1 0 1 1 ...
## $ ChanceofAdmit : num 0.92 0.5 0.52 0.84 0.78 0.61 0.65 0.62 0.64 0.74 ...
## predict.nn.net.result ChanceofAdmit
## 1 0.9426275 0.92
## 9 0.5836722 0.50
## 11 0.7322430 0.52
## 12 0.8433497 0.84
## 13 0.8611204 0.78
## 15 0.6458030 0.61
## [1] 0.004621707
## MSE NN.MSE
## 1 0.004260862 0.004621707
*** Model Comparison and Conclusion ***
We are basing our selection on The Mean Squared Error (MSE). It is a measure of how close a fitted line is to data points. For every data point, you take the distance vertically from the point to the corresponding y value on the curve fit (the error), and square the value.
From the above model, we can see that both the model are able to predict the chances of admit. However, looking at the mean square error (MSE), the linear regression is a better suited one as compared to neural network as it has lesses error than the latter.