Assignment 3 (10%)

Gopal Narasimhaiah

[DPO & 040703878]

Instructions

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.

Use RStudio for this assignment. Complete the assignment by inserting your R code wherever you see the string “#INSERT YOUR ANSWER HERE”.

When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.

Submit both the rmd and generated output files. Failing to submit both files will be subject to mark deduction.

Sample Question and Solution

Use seq() to create the vector \((2,4,6,\ldots,20)\).

#Insert your code here.
seq(2,20,by = 2)

##  [1]  2  4  6  8 10 12 14 16 18 20

Note:

You will use ‘Admission_Predict.csv’ for Assignment-3. This dataset includes the data of the applicants of an academic program. Each application has a unique serial number, which represents a particular student. The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

GRE Scores (out of 340)
TOEFL Scores (out of 120)
University Rating (out of 5)
Statement of Purpose (SOP) (out of 5)
Letter of Recommendation (LOR) Strength (out of 5)
Undergraduate GPA (out of 10)
Research Experience (either 0 or 1)
Chance of Admit (ranging from 0 to 1)

Download “Admission_Predict.csv” dataset and load it as ‘data’.

data <- read.csv("Admission_Predict.csv")

Question 1 (30 points in total)

i- Display the first three rows in this dataset.(1 point)

data[1:3,]

##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72

ii - Display the structure of all variables.(1 point)

str(data)

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

iii - Print the descriptive statistics of the admission data to understand the data a little better (min, max, mean, median, 1st and 3rd quartiles). (1 point)

summary(data)

##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
##  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP           LOR             CGPA          Research     
##  Min.   :1.0   Min.   :1.000   Min.   :6.800   Min.   :0.0000  
##  1st Qu.:2.5   1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000  
##  Median :3.5   Median :3.500   Median :8.610   Median :1.0000  
##  Mean   :3.4   Mean   :3.453   Mean   :8.599   Mean   :0.5475  
##  3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000  
##  Max.   :5.0   Max.   :5.000   Max.   :9.920   Max.   :1.0000  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6400  
##  Median :0.7300  
##  Mean   :0.7244  
##  3rd Qu.:0.8300  
##  Max.   :0.9700

iv - Use a histogram to assess the normality of the ‘Chance.of.Admit’ variable and explain whether it appears normally distributed or not and why? (1 point)

hist(data$Chance.of.Admit)

# It is skewed to the left. There are many observations piled at 1.0.
# It's not normal because it's not symmetric around the central tendency.

Create a set of boxplots that shows the distribution of Chance.of.Admit and SOP variables.Use different colors for different SOP scores month. (8 points)

colors <- c("red", "tomato", "violet", "yellow", "green", "gold", "blue", "purple", "plum")
boxplot(Chance.of.Admit ~ SOP, data=data, col=colors)

i- Find the covariance between the “GRE.Score” and the “Chance.of.Admit”. (3 points)

cov(data$GRE.Score, data$Chance.of.Admit)

## [1] 1.313271

ii- Find the correlation between the “GRE.Score”, “TOEFL.Score”, “CGPA” and the “Chance.of.Admit”. (3 points)

cor(data$GRE.Score, data$TOEFL.Score)

## [1] 0.8359768

cor(data$CGPA, data$Chance.of.Admit)

## [1] 0.8732891

iii - Interpret the covariance and correlation results obtained from i and ii in terms of the strength and direction of the relationship. (4 points)

# Scores are strongly positively correlated for an applicant, 
# and the undergrad GPA is also strongly positively correlated with
# chance of admission. It makes sense.

Use ggplot() to plot the graphs to see the relationship between each of three variables (GRE.Score, TOEFL.Score, CGPA) with Chance.of.Admit. (8 points)

library(ggplot2)
ggplot(data, aes(x=GRE.Score, y=Chance.of.Admit)) + geom_point(shape=1)

ggplot(data, aes(x=TOEFL.Score, y=Chance.of.Admit)) + geom_point(shape=1)

ggplot(data, aes(x=CGPA, y=Chance.of.Admit)) + geom_point(shape=1)

Question 2 (40 points in total)

Define the linear regression model between GRE.Score and Chance.of.Admit (3 points)

fit <- lm(Chance.of.Admit ~ GRE.Score, data=data)

i - Plot the regression (least-square) line on the same plot.(3 points)

plot(Chance.of.Admit ~ GRE.Score, data=data)
abline(fit)

ii - Explain the meaning of the slope and y-intercept for the least-squares regression line in (b). (3 points)

# The slope of GRE.Score, 0.0099759 means that the chance of admission goes up by about 0.01 or 1%
# for additional one point in GRE score on average.
# The y-intercept of -2.4360842 implies that the chance of admission for an application with 
# zero GRE.Score is zero (or as worse as negative).

Print the results of this model and interpret the results by following questions:

i - What is the number of observations was the regression run on? (3 points)

summary(fit)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33613 -0.04604  0.00408  0.05644  0.18339 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.4360842  0.1178141  -20.68   <2e-16 ***
## GRE.Score    0.0099759  0.0003716   26.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08517 on 398 degrees of freedom
## Multiple R-squared:  0.6442, Adjusted R-squared:  0.6433 
## F-statistic: 720.6 on 1 and 398 DF,  p-value: < 2.2e-16

# 400 observations

ii - Interpret the R-squared of this regression? (4 points)

# R-squared is 0.6442. It means that 64.42% of total variation in Chance.of.Admit is
# acccounted for by GRE.Score alone.

iii - Write the regression equation associated with this regression model? (4 points)

# Chance.of.Admit = -2.4360842 + 0.0099759 * GRE.Score

Use the regression line to predict the chance of admit when GRE score 310. (10 points)

predict(fit, newdata = data.frame(GRE.Score=310))

##         1 
## 0.6564392

Drawing on this linear model between GRE.Score and Chance.of.Admit, what should be GRE score of a student who has 50% of chance of admission? (10 points)

# 0.50 = -2.4360842 + 0.0099759 * GRE.Score 
# GRE.Score = (0.50 + 2.4360842)/0.0099759 
#           = 294.3177

Question 3 (30 points in total)

Use three variables (‘GRE.Score’,‘TOEFL.Score’, ‘CGPA’) to build a multiple linear regression model to predict ‘Chance.of.Admit’. Display a summary of your model indicating Residuals, Coefficients, …, etc. What conclusion can you draw from this summary? (8 points)

fit2 <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + CGPA, data=data)
summary(fit2)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + CGPA, 
##     data = data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.290375 -0.023030  0.008255  0.040153  0.143108 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.5856984  0.1058153 -14.986  < 2e-16 ***
## GRE.Score    0.0022660  0.0005929   3.822 0.000154 ***
## TOEFL.Score  0.0031123  0.0011070   2.812 0.005176 ** 
## CGPA         0.1462844  0.0111770  13.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06632 on 396 degrees of freedom
## Multiple R-squared:  0.7854, Adjusted R-squared:  0.7837 
## F-statistic:   483 on 3 and 396 DF,  p-value: < 2.2e-16

# All predictors are statistically significant, and R-squared substantially is higher than
# the simple regression. Two other predictors make a large contribution to explain Change.of.Admit.

Write the regression equation associated with this multiple regression model? (8 points)

# Chance.of.Admit = -1.5856984 
#                  + 0.0022660 * GRE.Score 
#                  + 0.0031123 * TOEFL.Score 
#                  + 0.1462844 * CGPA

Using this model:

i- Find the chance of admit for the 3rd student and 23rd students in the dataset. (4 points)

predict(fit2)[c(3, 23)]

##         3        23 
## 0.6242940 0.9082592

ii- Identify which one has higher chance than the other and print the difference between the chance of admit of these two students.(3 points)

predict(fit2)[23] - predict(fit2)[3]

##        23 
## 0.2839652

# 23rd - 3rd

Explain the difference between the model in Question 2(b) and the the model in Question 3(b) (7 points)

# The model in Question 2(b) has only one predictor to explain Chance.of.Admit while
# the model in Question 3(b) has three, and it explains better.

CIND 123 - Data Analytics: Basic Methods Winter2021 Assignment 3