Linear Regression on Graduate Admissions Prediction

1. Explanation

The beginning

Hello everyone , welcome to my Fourth Rpubs. I made this rmd to fulfill my LBB assignment. Hope You like it :)

The Data About

The dataset contains several parameters which are considered important during the application for Masters Programs.

Parameters included are :
  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Business Goal

This Linear Regression model build with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university. So we’ll try to predict the opportunity to admit on the master’s program based on their profiles in previous university.

2. Data Wrangling / Exploratory Data Analysis

Load dataset

admission <- read.csv("dataset/Admission_Predict.csv")
Done, let’s move to the next step

2.1 Dataset Inspection

Get first 5 rows

head(admission, 5)
##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65

Get last 5 rows

tail(admission, 5)
##     Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 396        396       324         110                 3 3.5 3.5 9.04        1
## 397        397       325         107                 3 3.0 3.5 9.11        1
## 398        398       330         116                 4 5.0 4.5 9.45        1
## 399        399       312         103                 3 3.5 4.0 8.78        0
## 400        400       333         117                 4 5.0 4.0 9.66        1
##     Chance.of.Admit
## 396            0.82
## 397            0.84
## 398            0.91
## 399            0.67
## 400            0.95

Get total rows / observation

nrow(admission)
## [1] 400

Get total columns

ncol(admission)
## [1] 9

Get all columns names

names(admission)
## [1] "Serial.No."        "GRE.Score"         "TOEFL.Score"      
## [4] "University.Rating" "SOP"               "LOR"              
## [7] "CGPA"              "Research"          "Chance.of.Admit"

Get dimension of dataset

dim(admission)
## [1] 400   9
From our inspection we can take few informations :
  • Admission dataset contains 400 of rows and 9 columns
  • Each of columns names : Serial.No. ,GRE.Score, TOEFL.Score, University.Rating, SOP, LOR, CGPA, Research, Chance.of.Admit

2.2 Explicit Data Coertion

Check the data type of all variable

str(admission)
## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Change University Rating and Research Column with Factor because that is repeated data type

admision_new <- admission %>%
  mutate_at(c("University.Rating", "Research"), as.factor)

str(admision_new)
## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

2.3 Data Correlation in every Variable with Matrix Correlation

ggcorr(admision_new,
       label = T,
       label_size = 2.5,
       cex = 2.6)

remove Serial.No. Column because it has no Correlation with other Variable

admision_new <- admision_new %>%
  select(-Serial.No.)

2.4 Check for Outlier

# melting the admission data frame for plotting all Variable
df.m <- melt(admission, id.vars = "Serial.No.")

ggplot(data = df.m, aes(x = variable, y = value)) +
  geom_boxplot(aes(fill = variable), show.legend = F) +
  labs(x = NULL, y = "Value") +
  facet_wrap(~ variable, scales = "free") +
  ggtitle("Outlier Detection in All Variable") +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.title.y = element_text(face = "bold"),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

Looks like we have lower outlier in Our Target Varialbe (Chance of Admit) and in Our Predictor Variable (LOR and CGPA)

2.6 Check for Missing Value

anyNA(admision_new)
## [1] FALSE

Yeayy, there is no missing value :D

3. Modeling

3.1 Create Linear Regression Model

model_admission <- lm(formula = Chance.of.Admit~., data = admision_new)

3.3 Check for summary of our Model

summary(model_admission)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admision_new)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.260486 -0.022911  0.009145  0.037471  0.162276 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.2438075  0.1271202  -9.784  < 2e-16 ***
## GRE.Score           0.0017118  0.0005986   2.860  0.00447 ** 
## TOEFL.Score         0.0030615  0.0010927   2.802  0.00533 ** 
## University.Rating2 -0.0147251  0.0147760  -0.997  0.31960    
## University.Rating3 -0.0093367  0.0161271  -0.579  0.56296    
## University.Rating4 -0.0073707  0.0196734  -0.375  0.70812    
## University.Rating5  0.0103680  0.0216662   0.479  0.63254    
## SOP                -0.0026190  0.0055744  -0.470  0.63874    
## LOR                 0.0227892  0.0055456   4.109 4.84e-05 ***
## CGPA                0.1187078  0.0122185   9.715  < 2e-16 ***
## Research1           0.0243631  0.0079708   3.057  0.00239 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06375 on 389 degrees of freedom
## Multiple R-squared:  0.8052, Adjusted R-squared:  0.8002 
## F-statistic: 160.8 on 10 and 389 DF,  p-value: < 2.2e-16
From our model summary we can conclude :
  • Our Predictor Variable (University.Rating and SOP) not significantly give an impact to Target Variable (Chance.of.Admit)
  • Because we have Multiple Predictor, we can focus in Adjusted R-Squared value only. From Adjusted R-Squared value we can conclude our Predictor Variable can explain 80% the variances of Target Variable (Chance of Admit)

4. Model Improvement

Let’s Improve our Model with Tunning Model(Feature Selection) and we will use backward elimination. So of all the predictors used, the model is then evaluated by reducing the predictor variables so that the smallest AIC model is obtained. The smaller the AIC value, the smaller the error we get.

model_backward <- step(object = model_admission, direction = "backward")
## Start:  AIC=-2191.44
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  4   0.01992 1.6006 -2194.4
## - SOP                1   0.00090 1.5816 -2193.2
## <none>                           1.5807 -2191.4
## - TOEFL.Score        1   0.03190 1.6126 -2185.4
## - GRE.Score          1   0.03323 1.6139 -2185.1
## - Research           1   0.03796 1.6186 -2183.9
## - LOR                1   0.06862 1.6493 -2176.4
## - CGPA               1   0.38354 1.9642 -2106.6
## 
## Step:  AIC=-2194.43
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + LOR + CGPA + 
##     Research
## 
##               Df Sum of Sq    RSS     AIC
## - SOP          1   0.00024 1.6008 -2196.4
## <none>                     1.6006 -2194.4
## - TOEFL.Score  1   0.03291 1.6335 -2188.3
## - GRE.Score    1   0.03585 1.6364 -2187.6
## - Research     1   0.03935 1.6400 -2186.7
## - LOR          1   0.07445 1.6750 -2178.2
## - CGPA         1   0.41691 2.0175 -2103.8
## 
## Step:  AIC=-2196.38
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.6008 -2196.4
## - TOEFL.Score  1   0.03292 1.6338 -2190.2
## - GRE.Score    1   0.03638 1.6372 -2189.4
## - Research     1   0.03912 1.6400 -2188.7
## - LOR          1   0.09133 1.6922 -2176.2
## - CGPA         1   0.43201 2.0328 -2102.8
summary(model_backward)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admision_new)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263542 -0.023297  0.009879  0.038078  0.159897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2984636  0.1172905 -11.070  < 2e-16 ***
## GRE.Score    0.0017820  0.0005955   2.992  0.00294 ** 
## TOEFL.Score  0.0030320  0.0010651   2.847  0.00465 ** 
## LOR          0.0227762  0.0048039   4.741 2.97e-06 ***
## CGPA         0.1210042  0.0117349  10.312  < 2e-16 ***
## Research1    0.0245769  0.0079203   3.103  0.00205 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06374 on 394 degrees of freedom
## Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
## F-statistic: 320.6 on 5 and 394 DF,  p-value: < 2.2e-16

Looks like our Adjusted R-Square is not increase, but our Predictor Variable is now much better. Because University.Rating and SOP has been removed. Now All of Our Predictor Variable is significant to Target Variable

Let’s Try to remove out Outlier form Chance.of.Admit (Target Variable) to improve the Model

# check for the ourlier in target variable with boxplot
boxplot(admision_new$Chance.of.Admit)

Looks like we have lower Outlier, let’s remove it

# check for minimum value 
min(admision_new$Chance.of.Admit)
## [1] 0.34
# remove the outlier
admision_no_outlier <- admision_new[admision_new$Chance.of.Admit != 0.34,]

boxplot(admision_no_outlier$Chance.of.Admit)

Create Linear Regression Model Again

model_no_outlier <- lm(Chance.of.Admit~. , admision_no_outlier)
model_back_no_outlier <- step(object = model_no_outlier, direction = "backward")
## Start:  AIC=-2195.95
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  4   0.01791 1.5304 -2199.3
## - SOP                1   0.00002 1.5125 -2197.9
## <none>                           1.5124 -2195.9
## - TOEFL.Score        1   0.02586 1.5383 -2191.2
## - GRE.Score          1   0.03094 1.5434 -2189.9
## - Research           1   0.03600 1.5484 -2188.6
## - LOR                1   0.06010 1.5725 -2182.4
## - CGPA               1   0.38239 1.8948 -2108.2
## 
## Step:  AIC=-2199.26
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + LOR + CGPA + 
##     Research
## 
##               Df Sum of Sq    RSS     AIC
## - SOP          1   0.00035 1.5307 -2201.2
## <none>                     1.5304 -2199.3
## - TOEFL.Score  1   0.02688 1.5572 -2194.3
## - GRE.Score    1   0.03291 1.5633 -2192.8
## - Research     1   0.03677 1.5671 -2191.8
## - LOR          1   0.06556 1.5959 -2184.6
## - CGPA         1   0.41563 1.9460 -2105.6
## 
## Step:  AIC=-2201.17
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.5307 -2201.2
## - TOEFL.Score  1   0.02888 1.5596 -2195.7
## - GRE.Score    1   0.03262 1.5633 -2194.8
## - Research     1   0.03776 1.5685 -2193.5
## - LOR          1   0.09136 1.6221 -2180.1
## - CGPA         1   0.44003 1.9707 -2102.6
summary(model_back_no_outlier)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = admision_no_outlier)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26400 -0.02298  0.00970  0.03647  0.15645 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.2583682  0.1153734 -10.907  < 2e-16 ***
## GRE.Score    0.0016892  0.0005844   2.890  0.00406 ** 
## TOEFL.Score  0.0028424  0.0010451   2.720  0.00683 ** 
## LOR          0.0227851  0.0047107   4.837  1.9e-06 ***
## CGPA         0.1222594  0.0115172  10.615  < 2e-16 ***
## Research1    0.0241487  0.0077654   3.110  0.00201 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06249 on 392 degrees of freedom
## Multiple R-squared:  0.8042, Adjusted R-squared:  0.8017 
## F-statistic:   322 on 5 and 392 DF,  p-value: < 2.2e-16

Not to significant the increasing of our Adjusted R-squred but it increase a bit :)

Let’s Try to Predict Chance of Admit from our Data Train

data_test = admision_new[,(!colnames(admision_new) %in% c("Chance.of.Admit"))][1:10,]

data_test
##    GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1        337         118                 4 4.5 4.5 9.65        1
## 2        324         107                 4 4.0 4.5 8.87        1
## 3        316         104                 3 3.0 3.5 8.00        1
## 4        322         110                 3 3.5 2.5 8.67        1
## 5        314         103                 2 2.0 3.0 8.21        0
## 6        330         115                 5 4.5 3.0 9.34        1
## 7        321         109                 3 3.0 4.0 8.20        1
## 8        308         101                 2 3.0 4.0 7.90        0
## 9        302         102                 1 2.0 1.5 8.00        0
## 10       323         108                 3 3.5 3.0 8.60        0
predict_admission <- predict(object = model_back_no_outlier, newdata = data_test)

predict_admission
##         1         2         3         4         5         6         7         8 
## 0.9527673 0.8041795 0.6529883 0.7393064 0.6369008 0.8603380 0.7114905 0.6059657 
##         9        10 
## 0.5539364 0.7139964
result <- data.frame(Prediction = predict_admission) %>% 
  mutate(Chance.Of.Admit = glue("{round(predict_admission * 100,0)}%"))

result
##    Prediction Chance.Of.Admit
## 1   0.9527673             95%
## 2   0.8041795             80%
## 3   0.6529883             65%
## 4   0.7393064             74%
## 5   0.6369008             64%
## 6   0.8603380             86%
## 7   0.7114905             71%
## 8   0.6059657             61%
## 9   0.5539364             55%
## 10  0.7139964             71%

Let’s Plot our Original data and the prediction Value

data_prediction <- data.frame(chance.of.admit = admision_no_outlier$Chance.of.Admit, predict = model_back_no_outlier$fitted.values)

ggplot(data_prediction , aes(x = chance.of.admit, y = predict)) + geom_point()  + 
    geom_smooth() + theme(panel.grid = element_blank(), panel.background = element_blank())

5. Model Evaluation

4.1 Error Value

Check our Error for Model Evaluation with MSE and RMSE

MSE(y_pred = model_no_outlier$fitted.values, y_true = admision_no_outlier$Chance.of.Admit)
## [1] 0.003800104
RMSE(y_pred = model_no_outlier$fitted.values, y_true = admision_no_outlier$Chance.of.Admit)
## [1] 0.06164498
# Check the range value of Chance.of.Admit
range(admision_no_outlier$Chance.of.Admit)
## [1] 0.36 0.97
From our Error Value we can conclude :
  • The Error from MSE pretty small because if take a look from our range value of Chace.of.Admit 0.36 - 0.97, that means our Error Value is Small :)
  • If we use Error Value from RMSE (0.06+- Chance.of.Admit) , is still small to, Our Model seems quite Good :D

4.2 Assumption Check

Linearity

Check for Linearity to find out the relationship between the predictor variable and the target variable

for (i in 2:(length(admission) - 1)) {
  a <- cor.test(admission$Chance.of.Admit, admission[,i])
  print(paste(colnames(admission)[i], " est:", a$estimate, " p=value:", a$p.value))
}
## [1] "GRE.Score  est: 0.80261045959035  p=value: 2.45811241417901e-91"
## [1] "TOEFL.Score  est: 0.791593986935104  p=value: 3.63410217599769e-87"
## [1] "University.Rating  est: 0.711250250391722  p=value: 6.63501948088894e-63"
## [1] "SOP  est: 0.675731858388672  p=value: 1.14109466710233e-54"
## [1] "LOR  est: 0.669888792010694  p=value: 2.00731451975237e-53"
## [1] "CGPA  est: 0.8732890993553  p=value: 2.33651400049777e-126"
## [1] "Research  est: 0.55320213701904  p=value: 1.91817338069221e-33"

Good. All of Our Predictor Variable have Correlation with Target Variable because p-value > alpha

Normality

Check for Normality of Residuals to find out is Our Residual/Error close to 0 which means the error is getting smaller

data_residuals <- data.frame(residual = model_no_outlier$residuals, fitted = model_no_outlier$fitted.values)

ggplot(data_residuals, aes(x = residual)) +
  geom_histogram() +
  labs(x = "Residuals", y = "Frequency", title = "Histogram of Residuals in Our Linear Regression Model") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

shapiro.test(model_backward$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_backward$residuals
## W = 0.92193, p-value = 1.443e-13

Not Bad. From the Histogram Graph and Shapiro Test Our Residual/Error is close to 0 and the p-value > alpha

Homoscedasticity

Check for Homoscedasticity to find out is our Model have some pattern or not. Because if our model have pattern trumpet or other pattern that will affect the value of the standard error on the estimate/coefficient of the predictor that is biased (too narrow or too wide).

ggplot(data_residuals, aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 22.428, df = 5, p-value = 0.0004341

ohh noo :( our model does not meet the Homoscedasticity assumption, let’s try to transform the predictor variable with a z-score transformation

# transform the data
# let's change back to the original data type (without factor)
admission_transform <- admision_no_outlier %>% 
  select(-University.Rating, -Research, -Chance.of.Admit) %>%
  scale()

admision_new <- admision_no_outlier %>%
  select(University.Rating, Research, Chance.of.Admit)

admission_transform <- cbind(admision_new, admission_transform)

head(admission_transform)
##   University.Rating Research Chance.of.Admit   GRE.Score TOEFL.Score
## 1                 4        1            0.92  1.75965600  1.74490707
## 2                 4        1            0.76  0.62131390 -0.07655291
## 3                 3        1            0.72 -0.07920432 -0.57331472
## 4                 3        1            0.80  0.44618434  0.42020890
## 5                 2        0            0.65 -0.25433387 -0.73890199
## 6                 5        1            0.90  1.14670256  1.24814526
##           SOP         LOR       CGPA
## 1  1.09058645  1.16181985  1.7614536
## 2  0.59452541  1.16181985  0.4488305
## 3 -0.39759666  0.04759262 -1.0152491
## 4  0.09846438 -1.06663461  0.1122605
## 5 -1.38971873 -0.50952099 -0.6618506
## 6  1.09058645 -0.50952099  1.2397700

Let’s train our model once again

model_transform <- lm(Chance.of.Admit~., admission_transform)
summary(model_transform)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission_transform)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.261044 -0.023159  0.009453  0.037670  0.158871 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.7183247  0.0157504  45.607  < 2e-16 ***
## University.Rating2 -0.0109780  0.0145322  -0.755 0.450452    
## University.Rating3 -0.0087090  0.0158316  -0.550 0.582567    
## University.Rating4 -0.0059103  0.0193252  -0.306 0.759895    
## University.Rating5  0.0119745  0.0212891   0.562 0.574120    
## Research1           0.0237290  0.0078186   3.035 0.002568 ** 
## GRE.Score           0.0188736  0.0067079   2.814 0.005148 ** 
## TOEFL.Score         0.0166831  0.0064859   2.572 0.010478 *  
## SOP                 0.0003548  0.0055578   0.064 0.949131    
## LOR                 0.0191847  0.0048923   3.921 0.000104 ***
## CGPA                0.0705145  0.0071287   9.892  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06251 on 387 degrees of freedom
## Multiple R-squared:  0.8065, Adjusted R-squared:  0.8015 
## F-statistic: 161.3 on 10 and 387 DF,  p-value: < 2.2e-16
data_residuals_2 <- data.frame(residual = model_transform$residuals, fitted = model_transform$fitted.values)

ggplot(data_residuals_2, aes(fitted, residual)) + geom_point() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

bptest(model_transform)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_transform
## BP = 37.262, df = 10, p-value = 5.099e-05

Yeayyy seems our Model meet the Homoscedasticity assumption, because from the Breusch-Pagan test, we have p-value > alpha, that means our Model not have a pattern / constant

check_model(model_transform)

From our Model Check Graoh above we can conclude :
  • Our Model have meet the Linearity Assumption
  • Our Model have meet the Homoscedasticity Assumption
  • Our Model have not Collinearity because VIF value from all Predictor Variable is < 10
  • Our Model have meet Normality of Residuals because the error that pretty close to 0 or Normal Distribution

6. Conclusion

Our Liner Regression Model (model_transform) have Adjusted R-Squared value 0.8015 and RMSE Error value 0.06 , that means our Predictor Variable can explain 80% the variances of Target Variable (Chance of Admit) and 0.06 +- Chance of Admit. From the Error value we can say our Model is pretty good, because the RMSE Error is quite small and Adjusted R-Square quite big.

From The Model we can conclude Research, GRE Score, TOEFL Score, LOR and CGPA Variable have positive correlation and significant impact to Chance of Admit Variable, that means the higher the value of the four variables ( Research, GRE Score, TOEFL Score, LOR and CGPA), the greater the opportunity to pass the master’s program at a university (Chance of Admit). And one more thing, University Rating and SOP not have significant impact to Chance of Admit event them have positive correlation.