Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows to estimate how a dependent variable changes as the independent variable(s) change.
Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical.
Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.
The multiple linear regression equation is as follows:
\[y = \beta_0 \space + \beta_1 X_1 \space + \beta_2 X_2 \space + \beta_3 X_3 + ..... + \beta_n X_n\space + \space \epsilon\] Where,
There are some assumption of multiple linear regression, they are:
Lets perform multiple linear regression on Graduate Admission dataset and find out the chance of admission of a student based on scores, sop, research Experience etc.
Here is the dataset link: Dataset
Dataset downloaded from kaggle in a csv format, load the dataset in to R.
The pupose of the analysis is graduate admission prediction. This dataset having below faetures.
Load necessary libraries.
Dataset consists of 400 observations and 8 features.
# data load and remove serual no.
graduate <- read.csv("/Users/subhalaxmirout/DATA 621/Admission_Predict.csv") %>% dplyr::select(-Serial.No.)
# data overview
glimpse(graduate)## Rows: 400
## Columns: 8
## $ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
## $ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
## $ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
## $ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
## $ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
## $ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…
Change the column name
names(graduate)[1] <- "GRE_Score"
names(graduate)[2] <- "TOEFL_Score"
names(graduate)[3] <- "University_Rating"
names(graduate)[8] <- "Admission_Chance"Checking for null values.
## [1] GRE_Score TOEFL_Score University_Rating SOP
## [5] LOR CGPA Research Admission_Chance
## <0 rows> (or 0-length row.names)
To see the relationship between variables plot pairplots.
library(GGally)
ggpairs(graduate,lower = list(continuous = wrap('points', colour = "blue")),
diag = list(continuous = wrap("barDiag", colour = "red"))
)The above plot shows high positive co-relation with GRE score, TOEFL score and CGPA the high chance of getting admission.
We separate the dataset in to 2 parts, train and test. Train dataset consist of 75% of actual data and test dataset consists of 25% of actual data.
set.seed(2)
sample = sample.split(graduate$Admission_Chance, SplitRatio = 0.75)
train = subset(graduate, sample == TRUE)
test = subset(graduate, sample == FALSE)
print(dim(train))## [1] 302 8
## [1] 98 8
We will make the multiple linear regression modeing using Chance of Admit as the target variable. We will include all the independent variable.
##
## Call:
## lm(formula = Admission_Chance ~ ., data = graduate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26259 -0.02103 0.01005 0.03628 0.15928
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2594325 0.1247307 -10.097 < 2e-16 ***
## GRE_Score 0.0017374 0.0005979 2.906 0.00387 **
## TOEFL_Score 0.0029196 0.0010895 2.680 0.00768 **
## University_Rating 0.0057167 0.0047704 1.198 0.23150
## SOP -0.0033052 0.0055616 -0.594 0.55267
## LOR 0.0223531 0.0055415 4.034 6.6e-05 ***
## CGPA 0.1189395 0.0122194 9.734 < 2e-16 ***
## Research 0.0245251 0.0079598 3.081 0.00221 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared: 0.8035, Adjusted R-squared: 0.8
## F-statistic: 228.9 on 7 and 392 DF, p-value: < 2.2e-16
Model showing low standard error, high \(R^2\) and low p-value (< 0.05). To check linearity we will see the residual analysis.
Above residual plot show model met the linearity assumption. So we can write the linear equation: \[Admission Chance = -1.25 + 0.0017 * GRE Score + 0.003 * TOEFLScore + 0.006 * UniversityRating + -0.003 * SOP + 0.022 * LOR + 0.119 * CGPA + 0.024 * Research\]
To predict the admission chance apply this model on test data. We will add new column named Admission Probability. It shows 2 values 0 and 1, of 0 the probability of admission is less if 1 then probability of admission is high.
Predict <- predict(model_mlr, test)
test$Predict <- ifelse(Predict < 0.6, "0", "1")
kable(test[1:10,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")| GRE_Score | TOEFL_Score | University_Rating | SOP | LOR | CGPA | Research | Admission_Chance | Predict | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 322 | 110 | 3 | 3.5 | 2.5 | 8.67 | 1 | 0.80 | 1 |
| 5 | 314 | 103 | 2 | 2.0 | 3.0 | 8.21 | 0 | 0.65 | 1 |
| 6 | 330 | 115 | 5 | 4.5 | 3.0 | 9.34 | 1 | 0.90 | 1 |
| 7 | 321 | 109 | 3 | 3.0 | 4.0 | 8.20 | 1 | 0.75 | 1 |
| 12 | 327 | 111 | 4 | 4.0 | 4.5 | 9.00 | 1 | 0.84 | 1 |
| 15 | 311 | 104 | 3 | 3.5 | 2.0 | 8.20 | 1 | 0.61 | 1 |
| 16 | 314 | 105 | 3 | 3.5 | 2.5 | 8.30 | 0 | 0.54 | 1 |
| 19 | 318 | 110 | 3 | 4.0 | 3.0 | 8.80 | 0 | 0.63 | 1 |
| 26 | 340 | 120 | 5 | 4.5 | 4.5 | 9.60 | 1 | 0.94 | 1 |
| 27 | 322 | 109 | 5 | 4.5 | 3.5 | 8.80 | 0 | 0.76 | 1 |
So, from above anlysis we learn