14 February 2020In this exercise, I want to predict someone’s chance of admission to a university of their choice using linear regression method based on other variables.
I will use MSE and RMSE as a measure of my model’s accuracy. For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.
Library packages that I’m using
The data is kindly provided by : Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
Checking the data
| Serial No. | GRE Score | TOEFL Score | University Rating | SOP | LOR | CGPA | Research | Chance of Admit |
|---|---|---|---|---|---|---|---|---|
| 1 | 337 | 118 | 4 | 4.5 | 4.5 | 9.65 | 1 | 0.92 |
| 2 | 324 | 107 | 4 | 4.0 | 4.5 | 8.87 | 1 | 0.76 |
| 3 | 316 | 104 | 3 | 3.0 | 3.5 | 8.00 | 1 | 0.72 |
| 4 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 | 0.80 |
| 5 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 | 0.65 |
| 6 | 330 | 115 | 5 | 4.5 | 3.0 | 9.34 | 1 | 0.90 |
| 7 | 321 | 109 | 3 | 3.0 | 4.0 | 8.20 | 1 | 0.75 |
| 8 | 308 | 101 | 2 | 3.0 | 4.0 | 7.90 | 0 | 0.68 |
| 9 | 302 | 102 | 1 | 2.0 | 1.5 | 8.00 | 0 | 0.50 |
| 10 | 323 | 108 | 3 | 3.5 | 3.0 | 8.60 | 0 | 0.45 |
The Data Explains :
1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )
I’m changing the column names to something more code-friendly & our research column from numeric to a logical TRUE and FALSE
In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.
set.seed(100)
intrain <- sample(nrow(admission), nrow(admission) *.8)
admission_train <- admission[intrain,]
admission_test <- admission[-intrain,]Checking if any of the variables are linearly related to each other.
Seeing that all variablers have a good correlation with each other except the serial number, I think we can move to modelling and exclude the serial number in our model.
I’ll make a linear model with the name model_admission
model_admission <- lm(formula = chance_of_admit ~ gre_score + toefl_score +
university_rating + lor + cgpa + research, data = admission_train)
summary(model_admission)##
## Call:
## lm(formula = chance_of_admit ~ gre_score + toefl_score + university_rating +
## lor + cgpa + research, data = admission_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.243033 -0.024065 0.007685 0.032871 0.153770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.4051368 0.1074149 -13.081 < 2e-16 ***
## gre_score 0.0023133 0.0005256 4.401 1.39e-05 ***
## toefl_score 0.0030671 0.0009075 3.380 0.000798 ***
## university_rating 0.0066859 0.0036723 1.821 0.069424 .
## lor 0.0114540 0.0041109 2.786 0.005590 **
## cgpa 0.1161438 0.0100539 11.552 < 2e-16 ***
## researchTRUE 0.0200083 0.0069335 2.886 0.004120 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05647 on 393 degrees of freedom
## Multiple R-squared: 0.8336, Adjusted R-squared: 0.831
## F-statistic: 328 on 6 and 393 DF, p-value: < 2.2e-16
I think I’m pretty happy with the resulting R squared and t value. The university rating has a lower t value but I personally think that it’s an important variable, so I’ll keep it inside the linear model.
##
## Shapiro-Wilk normality test
##
## data: model_admission$residuals
## W = 0.93567, p-value = 3.987e-12
##
## studentized Breusch-Pagan test
##
## data: model_admission
## BP = 22.58, df = 6, p-value = 0.0009502
Based on our bp test, there seemed to be abit of a pattern here, but looking at the graph, I think it’s still acceptable.
## gre_score toefl_score university_rating lor
## 4.400360 3.698528 2.133388 1.727872
## cgpa research
## 4.483945 1.484216
It seemed that our predictor variable does not correlate strongly with each other.
Our training vs model data MSE and RMSE.
## [1] 0.003132578
## [1] 0.05596944
Using the model I’ve built, I will try to test it with the admission_test dataset that we’ve split before.
## [1] 0.005341
## [1] 0.07308215
The data on its own has shown -even at the early linearity test- a very strong linearity between its target variable and its predictor variables. Very little feature engineering needs to be done before getting a good linear regression model.
My linear regression model received Adjusted R squared value of 0.8003. This means that the variable predictor chosen for the model can explain 80% of the target variable, in this case, the percentage of chance of admission.
The model MSE and RMSE when used against our test data is 0.005341 and 0.07308215 respectively.
Considering that our range of data is between 0.34 and 0.97, I think my model have achieved a pretty good prediction result with average error value of 0.042.
Our predictor variables that significantly affects our linear model, with the significance range value from 0 to 0.001 are GRE Score, TOEFL Score and Undergraduate GPA. The predictor variables with significance range value from 0.001 to 0.01 are Letter of Recommendation, and Research Experience.
The variables that greatly impacts our chance of admission is CGPA with a coefficient of 0.116. This means if there’s an increase of one unit in CGPA, our chance of admission goes up by 0.116.