Introduction

The purpose of this data analysis project is to analyze, predict, and observe the importance of these variables on the chance of admittance into a Masters graduate program.

The dataset contains several variables which are considered important during the application for Masters Programs. The variables included are :

GRE Scores ( out of 340 ) TOEFL Scores ( out of 120 ) University Rating ( out of 5 ) Statement of Purpose Strength ( out of 5) Letter of Recommendation Strength ( out of 5 ) Undergraduate GPA ( out of 10 ) Research Experience ( either 0 or 1 ) Chance of Admit ( ranging from 0 to 1 )

Dataset: https://www.kaggle.com/mohansacharya/graduate-admissions/home

## Warning: package 'GGally' was built under R version 3.6.2
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages ----------------------------------------------------- tidyverse 1.3.0 --
## v tibble  2.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3
## -- Conflicts -------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 3.0-1

Summmary Statistics

This sections provides the average, minimum, maximum, and standard deviation of each variable.

##   MeanGRE MeanTOEFL MeanRating MeanSOP MeanLOR MeanCGPA MeanResearch MeanChance
## 1 316.472   107.192      3.114   3.374   3.484  8.57644         0.56    0.72174
##   MinGRE MinTOEFL MinRating MinSOP MinLOR MinCGPA MinResearch MinChance
## 1    290       92         1      1      1     6.8           0      0.34
##   MaxGRE MaxTOEFL MaxRating MaxSOP MaxLOR MaxCGPA MaxResearch MaxChance
## 1    340      120         5      5      5    9.92           1      0.97

Distribution Analysis

This section observes the spread and distribution fo each variable. I will be looking to see if each variable contains a normal distribution to identify if the linear assumptions are violated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Linear Regression

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + 
##     University.Rating + LOR + CGPA + Research)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.266657 -0.023327  0.009191  0.033714  0.156818 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2757251  0.1042962 -12.232  < 2e-16 ***
## GRE.Score          0.0018585  0.0005023   3.700 0.000240 ***
## TOEFL.Score        0.0027780  0.0008724   3.184 0.001544 ** 
## SOP                0.0015861  0.0045627   0.348 0.728263    
## University.Rating  0.0059414  0.0038019   1.563 0.118753    
## LOR                0.0168587  0.0041379   4.074 5.38e-05 ***
## CGPA               0.1183851  0.0097051  12.198  < 2e-16 ***
## Research           0.0243075  0.0066057   3.680 0.000259 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05999 on 492 degrees of freedom
## Multiple R-squared:  0.8219, Adjusted R-squared:  0.8194 
## F-statistic: 324.4 on 7 and 492 DF,  p-value: < 2.2e-16

Lasso Regression

## 8 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)       -1.2168372457
## GRE.Score          0.0020934508
## TOEFL.Score        0.0009631816
## University.Rating  .           
## SOP                .           
## LOR                0.0079328697
## CGPA               0.1327655573
## Research           0.0094520469
##         RMSE  Rsquared
## 1 0.06668545 0.7629921

Conclusion

Citation: Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019