Multiple Linear Regression Analysis

Subhalaxmi Rout

12/20/2020


Multiple Linear Regression

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows to estimate how a dependent variable changes as the independent variable(s) change.

What is multiple linear regression?

Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical.

What is the use of multiple linear regression?

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.

Multiple linear regression equation

The multiple linear regression equation is as follows:

\[y = \beta_0 \space + \beta_1 X_1 \space + \beta_2 X_2 \space + \beta_3 X_3 + ..... + \beta_n X_n\space + \space \epsilon\] Where,

  • y = predicted value
  • \(\beta_0\) = y-intercept
  • \(\beta_1 X_1\) = the regression coefficient \((\beta_1)\) of the first independent variable (X1)
  • \(\epsilon\) = model error

There are some assumption of multiple linear regression, they are:

  • homoscedasticity of variance : the size of the error in prediction doesn’t change significantly across the values of the independent variable
  • Independence of observations: the obsevations does not have relationships among variables
  • Normality: Regression residuals must be normally distributed
  • Linearity: A linear relationship is assumed between the dependent variable and the independent variables

Lets perform multiple linear regression on Graduate Admission dataset and find out the chance of admission of a student based on scores, sop, research Experience etc.

Here is the dataset link: Dataset

Dataset downloaded from kaggle in a csv format, load the dataset in to R.

The pupose of the analysis is graduate admission prediction. This dataset having below faetures.

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )

Load necessary libraries.

library(ggplot2)
library(kableExtra)
library(dplyr)
library(caTools)
library(ggplot2)

EDA

Dataset consists of 400 observations and 8 features.

# data load and remove serual no.
graduate <- read.csv("/Users/subhalaxmirout/DATA 621/Admission_Predict.csv") %>% dplyr::select(-Serial.No.)
# data overview
glimpse(graduate)
## Rows: 400
## Columns: 8
## $ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 3…
## $ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 1…
## $ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, …
## $ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3…
## $ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4…
## $ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.0…
## $ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.5…

Change the column name

names(graduate)[1] <- "GRE_Score"
names(graduate)[2] <- "TOEFL_Score" 
names(graduate)[3] <- "University_Rating" 
names(graduate)[8] <- "Admission_Chance"

Checking for null values.

# null value check
graduate[!complete.cases(graduate),]
## [1] GRE_Score         TOEFL_Score       University_Rating SOP              
## [5] LOR               CGPA              Research          Admission_Chance 
## <0 rows> (or 0-length row.names)

To see the relationship between variables plot pairplots.

plot(graduate, col="blue")

library(GGally)
ggpairs(graduate,lower = list(continuous = wrap('points', colour = "blue")),
  diag = list(continuous = wrap("barDiag", colour = "red"))
    )

The above plot shows high positive co-relation with GRE score, TOEFL score and CGPA the high chance of getting admission.

Data Preparation

We separate the dataset in to 2 parts, train and test. Train dataset consist of 75% of actual data and test dataset consists of 25% of actual data.

set.seed(2)

sample = sample.split(graduate$Admission_Chance, SplitRatio = 0.75)

train = subset(graduate, sample == TRUE)
test = subset(graduate, sample == FALSE)

print(dim(train))
## [1] 302   8
print(dim(test))
## [1] 98  8

Model building

We will make the multiple linear regression modeing using Chance of Admit as the target variable. We will include all the independent variable.

model_mlr <- lm(Admission_Chance ~ ., data = graduate)
summary(model_mlr)
## 
## Call:
## lm(formula = Admission_Chance ~ ., data = graduate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26259 -0.02103  0.01005  0.03628  0.15928 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2594325  0.1247307 -10.097  < 2e-16 ***
## GRE_Score          0.0017374  0.0005979   2.906  0.00387 ** 
## TOEFL_Score        0.0029196  0.0010895   2.680  0.00768 ** 
## University_Rating  0.0057167  0.0047704   1.198  0.23150    
## SOP               -0.0033052  0.0055616  -0.594  0.55267    
## LOR                0.0223531  0.0055415   4.034  6.6e-05 ***
## CGPA               0.1189395  0.0122194   9.734  < 2e-16 ***
## Research           0.0245251  0.0079598   3.081  0.00221 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06378 on 392 degrees of freedom
## Multiple R-squared:  0.8035, Adjusted R-squared:    0.8 
## F-statistic: 228.9 on 7 and 392 DF,  p-value: < 2.2e-16

Model showing low standard error, high \(R^2\) and low p-value (< 0.05). To check linearity we will see the residual analysis.

hist(resid(model_mlr), col = 'steelblue', main = "Residual Disribution", xlab = "Residuals")

plot(model_mlr)

Above residual plot show model met the linearity assumption. So we can write the linear equation: \[Admission Chance = -1.25 + 0.0017 * GRE Score + 0.003 * TOEFLScore + 0.006 * UniversityRating + -0.003 * SOP + 0.022 * LOR + 0.119 * CGPA + 0.024 * Research\]

Prediction on test data

To predict the admission chance apply this model on test data. We will add new column named Admission Probability. It shows 2 values 0 and 1, of 0 the probability of admission is less if 1 then probability of admission is high.

Predict <- predict(model_mlr, test)
test$Predict <- ifelse(Predict < 0.6, "0", "1")
kable(test[1:10,]) %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),latex_options="scale_down")
GRE_Score TOEFL_Score University_Rating SOP LOR CGPA Research Admission_Chance Predict
4 322 110 3 3.5 2.5 8.67 1 0.80 1
5 314 103 2 2.0 3.0 8.21 0 0.65 1
6 330 115 5 4.5 3.0 9.34 1 0.90 1
7 321 109 3 3.0 4.0 8.20 1 0.75 1
12 327 111 4 4.0 4.5 9.00 1 0.84 1
15 311 104 3 3.5 2.0 8.20 1 0.61 1
16 314 105 3 3.5 2.5 8.30 0 0.54 1
19 318 110 3 4.0 3.0 8.80 0 0.63 1
26 340 120 5 4.5 4.5 9.60 1 0.94 1
27 322 109 5 4.5 3.5 8.80 0 0.76 1

Summary

So, from above anlysis we learn

  • Multiple linear regression and use of MLR
  • How to make linear model using R
  • Model prediction on test data