Supervised learning- graduate admissions

2. Exploratory Data Analysis

Viewing the Dataset

We can observe the variables in the dataset by using the head() and str() function.

head(data)

##   SerialNo. GREScore TOEFLScore UniversityRating SOP LOR CGPA Research
## 1         1      337        118                4 4.5 4.5 9.65        1
## 2         2      324        107                4 4.0 4.5 8.87        1
## 3         3      316        104                3 3.0 3.5 8.00        1
## 4         4      322        110                3 3.5 2.5 8.67        1
## 5         5      314        103                2 2.0 3.0 8.21        0
## 6         6      330        115                5 4.5 3.0 9.34        1
##   ChanceofAdmit
## 1          0.92
## 2          0.76
## 3          0.72
## 4          0.80
## 5          0.65
## 6          0.90

str(data)

## 'data.frame':    500 obs. of  9 variables:
##  $ SerialNo.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GREScore        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFLScore      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ UniversityRating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP             : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR             : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA            : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research        : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ ChanceofAdmit   : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

summary(data)

##    SerialNo.        GREScore       TOEFLScore    UniversityRating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000   
##  1st Qu.:125.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000   
##  Median :250.5   Median :317.0   Median :107.0   Median :3.000   
##  Mean   :250.5   Mean   :316.5   Mean   :107.2   Mean   :3.114   
##  3rd Qu.:375.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000   
##  Max.   :500.0   Max.   :340.0   Max.   :120.0   Max.   :5.000   
##       SOP             LOR             CGPA          Research   
##  Min.   :1.000   Min.   :1.000   Min.   :6.800   Min.   :0.00  
##  1st Qu.:2.500   1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00  
##  Median :3.500   Median :3.500   Median :8.560   Median :1.00  
##  Mean   :3.374   Mean   :3.484   Mean   :8.576   Mean   :0.56  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00  
##  Max.   :5.000   Max.   :5.000   Max.   :9.920   Max.   :1.00  
##  ChanceofAdmit   
##  Min.   :0.3400  
##  1st Qu.:0.6300  
##  Median :0.7200  
##  Mean   :0.7217  
##  3rd Qu.:0.8200  
##  Max.   :0.9700

We can see that we are having numerical data from research which is having binary data as 1 and 0.

```{ r NA values}

colSums(is.na(data))


we can see from the checking result, there are no missing values in each columns of our dataset. So now our dataset is complete and ready to be used to do further analysis.



**Distributions**


```r
ggplot(data = data, aes(x=ChanceofAdmit)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s see the Normal Qualtile Plot for Chance of Admit in our dataset.

qqnorm(data$ChanceofAdmit)
qqline(data$ChanceofAdmit, col = 2)

Relationships

pairs(data[-7])

cor.test(~ CGPA + ChanceofAdmit, data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and ChanceofAdmit
## t = 41.855, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8613745 0.9004286
## sample estimates:
##       cor 
## 0.8824126

cor.test(~ TOEFLScore +ChanceofAdmit , data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  TOEFLScore and ChanceofAdmit
## t = 28.972, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7571359 0.8227603
## sample estimates:
##       cor 
## 0.7922276

cor.test(~ CGPA + GREScore, data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and GREScore
## t = 32.686, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7958222 0.8518743
## sample estimates:
##      cor 
## 0.825878

cor.test(~ GREScore + ChanceofAdmit, data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  GREScore and ChanceofAdmit
## t = 30.862, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7779406 0.8384601
## sample estimates:
##       cor 
## 0.8103506

cor.test(~ CGPA + TOEFLScore, data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  CGPA and TOEFLScore
## t = 30.887, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7781969 0.8386529
## sample estimates:
##       cor 
## 0.8105735

cor.test(~ TOEFLScore + GREScore, data=data,)

## 
##  Pearson's product-moment correlation
## 
## data:  TOEFLScore and GREScore
## t = 32.852, df = 498, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7973476 0.8530152
## sample estimates:
##       cor 
## 0.8272004

ggplot( data=data, aes(x= ChanceofAdmit, y=CGPA))+ geom_point()

ggplot( data=data, aes(x= ChanceofAdmit, y=CGPA))+ geom_point()

Here we can see dependency between CGPA and chances of admit . Following observations can be made from this plot:
1. Higher the CGPA a student has, there are higher chances that he will get an admit to his desired university.
2. But there are some cases where student has higher CGPA but he is not getting an admit , that is because it depends on other parameters like GRE, TOEFL and University Rating.

In order to obtain insights through visualization we are using correlation matrix and scatter plot.

Visualizations through Correlation:

## [1] "SerialNo."        "GREScore"         "TOEFLScore"       "UniversityRating"
## [5] "SOP"              "LOR"              "CGPA"             "Research"        
## [9] "ChanceofAdmit"

## Warning: Use of `data$ChanceofAdmit` is discouraged. Use `ChanceofAdmit`
## instead.

## Warning: Use of `data$TOEFLScore` is discouraged. Use `TOEFLScore` instead.

As you can see higher the TOEFL score, usually there is higher chance of getting an admit.
But TOEFL score is not the only parameter having an influence on chance of admit that’s why there are some cases in which a student has higher TOEFL score but still he didn’t receive an admit due to other reasons like GRE, CGPA influencing chances of admit.

Study correlation among numeric columns

## Warning: package 'corrplot' was built under R version 4.1.3

## corrplot 0.92 loaded

Visualize correlation As we can see this is the correlation plot among all the parameters used in this analysis. Following obeservations can be seen in the above plot-
1. Darker the dot is in blue color, stronger is the interdependency among two parameters.
2. As we can see from the ;plot, if a person has a higher GRE then there are very high chances that his CGPA will also be on the higher sid since the color is dark blue.
3. Another example is CGPA and Chances of admit, we can see there is a dark blue dot there which implies if a person has higher CGPA then his chance of admit would also be higher. Scatter plot of research vs chance of admission

## Warning: Use of `data$ChanceofAdmit` is discouraged. Use `ChanceofAdmit`
## instead.

## Warning: Use of `data$GREScore` is discouraged. Use `GREScore` instead.

## Warning: Use of `data$UniversityRating` is discouraged. Use `UniversityRating`
## instead.

This scatter plot depicts relationship between GRE score and chances of admit. Following observations can be made from this plot:
1. Here there is another factor University rating which can be seen affecting the chances of admit.
2. As we can see, higher the GRE score doesn’t mean higher the chances of admit. It actually depends on the university he is applying. If he is applying to lower ranked University then chances of him getting an admit get increased.

Train the machine learning model

Linear regression is an attempt to model the relationship between two variables by fitting a linear equation to observed data, where one variable is considered to be an explanatory variable and the other as a dependent variable.In our model we have variables such as GRE, TOFEL, CGPA, SOP, LOR and Research which for the explanatory variable where as the ChanceOfAdmit is the dependent variable. From using this model, we can see that the variable are directly proportional to the ChanceToAdmit, i.e. the chances of admit increases when we have high Gre, tofel or CGPA.

Moreover, the lowest (Mean Square Mean) depicts that the predicted value is closer to the actual value indicating it to be as a perfect fit.

library(caTools)
sample <- sample.split(data $GREScore, SplitRatio = 0.7)
train <- subset(data, sample == T)
test <- subset(data, sample == F)
model <- lm(ChanceofAdmit ~ GREScore + TOEFLScore+ UniversityRating + SOP + LOR + CGPA + Research, data = train)
summary(model)

## 
## Call:
## lm(formula = ChanceofAdmit ~ GREScore + TOEFLScore + UniversityRating + 
##     SOP + LOR + CGPA + Research, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26889 -0.02366  0.00782  0.03238  0.15299 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.1800774  0.1200963  -9.826  < 2e-16 ***
## GREScore          0.0017699  0.0005852   3.025  0.00268 ** 
## TOEFLScore        0.0029435  0.0010322   2.852  0.00461 ** 
## UniversityRating  0.0059116  0.0044917   1.316  0.18902    
## SOP               0.0061649  0.0053616   1.150  0.25102    
## LOR               0.0153605  0.0048487   3.168  0.00167 ** 
## CGPA              0.1074420  0.0113393   9.475  < 2e-16 ***
## Research          0.0244045  0.0075856   3.217  0.00142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05787 on 341 degrees of freedom
## Multiple R-squared:  0.8273, Adjusted R-squared:  0.8237 
## F-statistic: 233.3 on 7 and 341 DF,  p-value: < 2.2e-16

predicted <- predict(model, test)
data2 <- data.frame(predicted, test$ChanceofAdmit)
head(data2)

##    predicted test.ChanceofAdmit
## 1  0.9454374               0.92
## 9  0.5554829               0.50
## 11 0.7348171               0.52
## 12 0.8342141               0.84
## 13 0.8496717               0.78
## 15 0.6519425               0.61

MSE <- mean((predicted - test$ChanceofAdmit)^2)
MSE

## [1] 0.004260862

m <-  model.matrix(~ChanceofAdmit + GREScore + TOEFLScore + UniversityRating + SOP + LOR + CGPA + Research, data = train)
f <- ChanceofAdmit ~ GREScore + TOEFLScore + UniversityRating + SOP + LOR + CGPA + Research

*** BY USING NEURAL NETWORKS*** Neural Network is used where we have large sets of data and accuracy is the most important factor in predicting a result.Since our model predicts chances of admit for a student applying to different universities we used Neural Network since we require maximum accuracy from our analysis.

*** BY USING NEURAL NETWORKS***

## 'data.frame':    151 obs. of  9 variables:
##  $ SerialNo.       : int  1 9 11 12 13 15 18 20 21 32 ...
##  $ GREScore        : int  337 302 325 327 328 311 319 303 312 327 ...
##  $ TOEFLScore      : int  118 102 106 111 112 104 106 102 107 103 ...
##  $ UniversityRating: int  4 1 3 4 4 3 3 3 3 3 ...
##  $ SOP             : num  4.5 2 3.5 4 4 3.5 4 3.5 3 4 ...
##  $ LOR             : num  4.5 1.5 4 4.5 4.5 2 3 3 2 4 ...
##  $ CGPA            : num  9.65 8 8.4 9 9.1 8.2 8 8.5 7.9 8.3 ...
##  $ Research        : int  1 0 1 1 1 1 1 0 1 1 ...
##  $ ChanceofAdmit   : num  0.92 0.5 0.52 0.84 0.78 0.61 0.65 0.62 0.64 0.74 ...

##    predict.nn.net.result ChanceofAdmit
## 1              0.9426275          0.92
## 9              0.5836722          0.50
## 11             0.7322430          0.52
## 12             0.8433497          0.84
## 13             0.8611204          0.78
## 15             0.6458030          0.61

## [1] 0.004621707

##           MSE      NN.MSE
## 1 0.004260862 0.004621707

*** Model Comparison and Conclusion ***

We are basing our selection on The Mean Squared Error (MSE). It is a measure of how close a fitted line is to data points. For every data point, you take the distance vertically from the point to the corresponding y value on the curve fit (the error), and square the value.

From the above model, we can see that both the model are able to predict the chances of admit. However, looking at the mean square error (MSE), the linear regression is a better suited one as compared to neural network as it has lesses error than the latter.

Supervised learning- graduate admissions

sriya, Sahil, Prakhar, Sai, Jayanth, Navnish

4/5/2022

1. Introduction

2. Exploratory Data Analysis