GRE Score prediction from CGPA

Simple Linear Regression

Sumukha Venkatesha Murthy (s3797866)
Anilkumar Lingaraj Biradar (s3798024)
Amogha Amaresh (s3789160)

Last updated: 27 October, 2019

Introduction

Problem Statement

Data

Data (cont..)

  1. Serial No. - This number represents the row number.
  2. GRE Score - Graduate Record Examinations is scored out of 340 .
  3. TOEFL Score - The TOEFL test is scored on a scale of 0 to 120 points.
  4. University Rating - This column has various rating levels of University ranging from 1 to 5.
  5. SOP - The rating of Statement Of Purpose on a scale of 1 to 5.
  6. LOR - The rating of Letter Of Recommendation on a scale of 1 to 5.
  7. CGPA - This column shows the Cumulative Grade Points Average obtained by a undergraduate student. This grading point is out of 10 .
  8. Research - This column represents whether the student has research experience (1) or no research experience (0).
  9. Chance of Admit - This column represents chances of student getting admit ranging from 0 to 1.

Descriptive Statistics

admit <- read_csv("Admission_Predict.csv")
#Descriptive Statisctics for CGPA
admit %>%summarise(Min = min(CGPA,na.rm = TRUE),
                                           Q1 = quantile(CGPA,probs = .25,na.rm = TRUE),
                                           Median = median(CGPA, na.rm = TRUE),
                                           Q3 = quantile(CGPA,probs = .75,na.rm = TRUE),
                                           Max = max(CGPA,na.rm = TRUE),
                                           Mean = mean(CGPA, na.rm = TRUE),
                                           SD = sd(CGPA, na.rm = TRUE),n = n(),Missing = sum(is.na(CGPA))) -> table1
knitr::kable(table1)
Min Q1 Median Q3 Max Mean SD n Missing
6.8 8.17 8.61 9.0625 9.92 8.598925 0.5963171 400 0
#Descriptive Statisctics for GRE Score
admit %>%summarise(Min = min(`GRE Score`,na.rm = TRUE),
                                           Q1 = quantile(`GRE Score`,probs = .25,na.rm = TRUE),
                                           Median = median(`GRE Score`, na.rm = TRUE),
                                           Q3 = quantile(`GRE Score`,probs = .75,na.rm = TRUE),
                                           Max = max(`GRE Score`,na.rm = TRUE),
                                           Mean = mean(`GRE Score`, na.rm = TRUE),
                                           SD = sd(`GRE Score`, na.rm = TRUE),n = n(),Missing = sum(is.na(`GRE Score`))) -> table2
knitr::kable(table2)
Min Q1 Median Q3 Max Mean SD n Missing
290 308 317 325 340 316.8075 11.47365 400 0

Data Issues

par(mfrow=c(1,2))
Boxplot(admit$CGPA , ylab = "CGPA")
## [1] 59
Boxplot(admit$`GRE Score` , ylab = "GRE Score")
Figure 1

Figure 1

#Removing outlier 
admit <- admit[-59,]

Visualisation

ggplot(admit, aes(x=CGPA, y=`GRE Score`))+
  geom_point(color="deepskyblue3" )
Figure 2

Figure 2

Hypothesis Testing

  1. Independence
  2. Linearity
  3. Normality of residuals
  4. Homoscedasticity
  5. Residual vs Leverage(Influential cases)

Testing the Overall Model

cgpaGREmodel <- lm(`GRE Score` ~ CGPA, data = admit)
cgpaGREmodel %>% summary()
## 
## Call:
## lm(formula = `GRE Score` ~ CGPA, data = admit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1178  -4.2131   0.4833   3.7378  26.9171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 177.6010     4.6387   38.29   <2e-16 ***
## CGPA         16.1852     0.5379   30.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.334 on 397 degrees of freedom
## Multiple R-squared:  0.6952, Adjusted R-squared:  0.6944 
## F-statistic: 905.4 on 1 and 397 DF,  p-value: < 2.2e-16

Linear Regression - Testing Model Parameters

cgpaGREmodel %>% summary() %>% coef()
##              Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 177.60099   4.638680 38.28697 2.408109e-135
## CGPA         16.18524   0.537905 30.08940 1.843967e-104
cgpaGREmodel %>% confint()
##                 2.5 %    97.5 %
## (Intercept) 168.48154 186.72043
## CGPA         15.12774  17.24274

Linear Regression - Testing Model Parameters (cont..)

cgpaGREmodel %>% summary() %>% coef()
##              Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 177.60099   4.638680 38.28697 2.408109e-135
## CGPA         16.18524   0.537905 30.08940 1.843967e-104
cgpaGREmodel %>% confint()
##                 2.5 %    97.5 %
## (Intercept) 168.48154 186.72043
## CGPA         15.12774  17.24274

Summarisation of the linear relationship

ggplot(admit, aes(x=CGPA, y=`GRE Score`))+
  geom_point(color="deepskyblue3" )+
  geom_smooth(method=lm,se=FALSE)
Figure 3

Figure 3

Validation of all the assumptions for linear regression

  1. Independence: Independence was assumed as each CGPA and GRE Score measurement came from different people.
  2. Linearity: The scatter plot (Figure 4) suggested a linear relationship. Other non-linear relationships are ruled out. There is no non-linear trends in the Residual vs. fitted plot.
  3. Normality of residuals: Normal Q-Q plot (Figure 5) didn’t show any obvious departures from normality. Most of the residuals fell close to the line.
par(mfrow=c(1,2))
cgpaGREmodel %>% plot(which=1)
cgpaGREmodel %>% plot(which=2)
Figure 4 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Figure 5

Figure 4                                                                             Figure 5

Validation of all the assumptions for linear regression (cont..)

  1. Homoscedasticity: Homoscedasticity looked fine according to the scale-location plot (Figure 6). The variance in residuals appeared constant across predicted values.
  2. Residual vs Leverage: In the plot (Figure 7), there are no values that fall outside the bands, therefore, there appeared to be no influential cases.
par(mfrow=c(1,2))
cgpaGREmodel %>% plot(which=3)
cgpaGREmodel %>% plot(which=5)
Figure 6 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Figure 7

Figure 6                                                                             Figure 7

Discussion

  1. The dataset was easy to observe and understand. It also includes essential metadata.
  2. The variables were informative and performed the required hypothesis testing without any hindrance.
  1. This investigation is constrained to two quantitative variables (CGPA and GRE Score).
  2. There are several other parameters which can be considered for prediction of GRE Score.

Discussion (cont..)

  1. Besides, an investigation in the future should have an expansion in the dataset where it includes other parameters such as IQ Level, strength in solving maths problems, mastery in english, to predict the GRE Score of the student.
  2. In the future, Multiple Linear Regression can be carried out by considering multiple predictor variables.

Thus from the findings of the above investigation conducted, Simple Linear Regression found that there is a statistically significant positive linear relationship between a CGPA and GRE Score. Thus, the investigation concludes as the CGPA score increases, the GRE Score also rises.

References