This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. Review this website for more details on using R Markdown http://rmarkdown.rstudio.com.
Use RStudio for this assignment. Complete the assignment by inserting your R code wherever you see the string “#INSERT YOUR ANSWER HERE”.
When you click the Knit button, a document (PDF, Word, or HTML format) will be generated that includes both the assignment content as well as the output of any embedded R code chunks.
Submit both the rmd and generated output files. Failing to submit both files will be subject to mark deduction.
Use seq()
to create the vector \((2,4,6,\ldots,20)\).
#Insert your code here.
seq(2,20,by = 2)
## [1] 2 4 6 8 10 12 14 16 18 20
You will use ‘Admission_Predict.csv’ for Assignment-3. This dataset includes the data of the applicants of an academic program. Each application has a unique serial number, which represents a particular student. The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
GRE Scores (out of 340)
TOEFL Scores (out of 120)
University Rating (out of 5)
Statement of Purpose (SOP) (out of 5)
Letter of Recommendation (LOR) Strength (out of 5)
Undergraduate GPA (out of 10)
Research Experience (either 0 or 1)
Chance of Admit (ranging from 0 to 1)
Download “Admission_Predict.csv” dataset and load it as ‘data’.
data <- read.csv("Admission_Predict.csv")
data[1:3,]
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
ii - Display the structure of all variables.(1 point)
str(data)
## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
iii - Print the descriptive statistics of the admission data to understand the data a little better (min, max, mean, median, 1st and 3rd quartiles). (1 point)
summary(data)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :200.5 Median :317.0 Median :107.0 Median :3.000
## Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
## 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.0 Min. :1.000 Min. :6.800 Min. :0.0000
## 1st Qu.:2.5 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.0000
## Median :3.5 Median :3.500 Median :8.610 Median :1.0000
## Mean :3.4 Mean :3.453 Mean :8.599 Mean :0.5475
## 3rd Qu.:4.0 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:1.0000
## Max. :5.0 Max. :5.000 Max. :9.920 Max. :1.0000
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6400
## Median :0.7300
## Mean :0.7244
## 3rd Qu.:0.8300
## Max. :0.9700
iv - Use a histogram to assess the normality of the ‘Chance.of.Admit’ variable and explain whether it appears normally distributed or not and why? (1 point)
hist(data$Chance.of.Admit)
# It is skewed to the left. There are many observations piled at 1.0.
# It's not normal because it's not symmetric around the central tendency.
colors <- c("red", "tomato", "violet", "yellow", "green", "gold", "blue", "purple", "plum")
boxplot(Chance.of.Admit ~ SOP, data=data, col=colors)
i- Find the covariance between the “GRE.Score” and the “Chance.of.Admit”. (3 points)
cov(data$GRE.Score, data$Chance.of.Admit)
## [1] 1.313271
ii- Find the correlation between the “GRE.Score”, “TOEFL.Score”, “CGPA” and the “Chance.of.Admit”. (3 points)
cor(data$GRE.Score, data$TOEFL.Score)
## [1] 0.8359768
cor(data$CGPA, data$Chance.of.Admit)
## [1] 0.8732891
iii - Interpret the covariance and correlation results obtained from i and ii in terms of the strength and direction of the relationship. (4 points)
# Scores are strongly positively correlated for an applicant,
# and the undergrad GPA is also strongly positively correlated with
# chance of admission. It makes sense.
library(ggplot2)
ggplot(data, aes(x=GRE.Score, y=Chance.of.Admit)) + geom_point(shape=1)
ggplot(data, aes(x=TOEFL.Score, y=Chance.of.Admit)) + geom_point(shape=1)
ggplot(data, aes(x=CGPA, y=Chance.of.Admit)) + geom_point(shape=1)
fit <- lm(Chance.of.Admit ~ GRE.Score, data=data)
i - Plot the regression (least-square) line on the same plot.(3 points)
plot(Chance.of.Admit ~ GRE.Score, data=data)
abline(fit)
ii - Explain the meaning of the slope and y-intercept for the least-squares regression line in (b). (3 points)
# The slope of GRE.Score, 0.0099759 means that the chance of admission goes up by about 0.01 or 1%
# for additional one point in GRE score on average.
# The y-intercept of -2.4360842 implies that the chance of admission for an application with
# zero GRE.Score is zero (or as worse as negative).
i - What is the number of observations was the regression run on? (3 points)
summary(fit)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33613 -0.04604 0.00408 0.05644 0.18339
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4360842 0.1178141 -20.68 <2e-16 ***
## GRE.Score 0.0099759 0.0003716 26.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08517 on 398 degrees of freedom
## Multiple R-squared: 0.6442, Adjusted R-squared: 0.6433
## F-statistic: 720.6 on 1 and 398 DF, p-value: < 2.2e-16
# 400 observations
ii - Interpret the R-squared of this regression? (4 points)
# R-squared is 0.6442. It means that 64.42% of total variation in Chance.of.Admit is
# acccounted for by GRE.Score alone.
iii - Write the regression equation associated with this regression model? (4 points)
# Chance.of.Admit = -2.4360842 + 0.0099759 * GRE.Score
predict(fit, newdata = data.frame(GRE.Score=310))
## 1
## 0.6564392
# 0.50 = -2.4360842 + 0.0099759 * GRE.Score
# GRE.Score = (0.50 + 2.4360842)/0.0099759
# = 294.3177
fit2 <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + CGPA, data=data)
summary(fit2)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + CGPA,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.290375 -0.023030 0.008255 0.040153 0.143108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5856984 0.1058153 -14.986 < 2e-16 ***
## GRE.Score 0.0022660 0.0005929 3.822 0.000154 ***
## TOEFL.Score 0.0031123 0.0011070 2.812 0.005176 **
## CGPA 0.1462844 0.0111770 13.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06632 on 396 degrees of freedom
## Multiple R-squared: 0.7854, Adjusted R-squared: 0.7837
## F-statistic: 483 on 3 and 396 DF, p-value: < 2.2e-16
# All predictors are statistically significant, and R-squared substantially is higher than
# the simple regression. Two other predictors make a large contribution to explain Change.of.Admit.
# Chance.of.Admit = -1.5856984
# + 0.0022660 * GRE.Score
# + 0.0031123 * TOEFL.Score
# + 0.1462844 * CGPA
i- Find the chance of admit for the 3rd student and 23rd students in the dataset. (4 points)
predict(fit2)[c(3, 23)]
## 3 23
## 0.6242940 0.9082592
ii- Identify which one has higher chance than the other and print the difference between the chance of admit of these two students.(3 points)
predict(fit2)[23] - predict(fit2)[3]
## 23
## 0.2839652
# 23rd - 3rd
# The model in Question 2(b) has only one predictor to explain Chance.of.Admit while
# the model in Question 3(b) has three, and it explains better.