Wei Zhang s3759607
Last updated: 01 June, 2019
This dataset was analysed to identify factors contributing to successful graduate admission and therefore give undergraduate students ideas on how to work towards a graduate program.
Aim
Variables
Independent sample t-test is utilised to compare mean admission rates between two groups: Students with research experience and students without research experience.
Correlation and simple linear regression are used to examine the relationship between undergraduate GPA and chances of admission.
admission <- read_csv("Admission_Predict_Ver1.1.csv")
admission$Research <- factor(admission$Research, levels = c(1, 0),labels = c('Yes', 'No'))This dataset is from https://www.kaggle.com/mohansacharya/graduate-admissions
It’s open source and has a Creative Commons Licence (CC0: Public Domain).
This dataset has a mixture of qualitative and quantitative variables. This analysis uses the following variables.
Undergraduate GPA: numeric out of 10
Research Experience: categorical, 0 or 1 . 0 represents students with no research experience, 1 represents students with research experience
Chance of Admit: numeric, ranging from 0 to 1
No missing values for variables: Chance of Admit, CGPA
Box plot is used to detect outliers for variable “Chance of Admit”. The output shows two outliers with the same value (Chance of Admit = 0.34). The two outliers’ value is not too far away from the lower fence in the box plot. Also, according to the two outlier rows filtered, they may just be two cases with low admission chance. Therefore, I have decided not to exclude these values as outliers.
## [1] 0
## [1] 0
## [1] 0.34 0.34
Depending on the presence of research experience, the admission chance is shown in this table.
admission %>% group_by(Research) %>% summarise(Min = min(`Chance of Admit`, na.rm = TRUE),
Q1 = quantile(`Chance of Admit`, probs = 0.25, na.rm = TRUE),
Median = median(`Chance of Admit`, na.rm=TRUE),
Q3 = quantile(`Chance of Admit`, probs = 0.75, na.rm = TRUE),
Max = max(`Chance of Admit`, na.rm = TRUE),
Mean = mean(`Chance of Admit`, na.rm = TRUE),
SD = sd(`Chance of Admit`, na.rm = TRUE),
n = n(), Missing = sum(is.na(`Chance of Admit`))) -> table1
knitr::kable(table1)| Research | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Yes | 0.36 | 0.7200 | 0.800 | 0.8925 | 0.97 | 0.7899643 | 0.1232083 | 280 | 0 |
| No | 0.34 | 0.5675 | 0.645 | 0.7100 | 0.89 | 0.6349091 | 0.1119177 | 220 | 0 |
There are 3 assumptions to be checked
The two groups (with and without research experience) are independent of each other
According to the qqPlot results, data points mostly fall within the blue lines. Also, both groups have large sample size. Normality can be assumed according to Central Limit Theorem.
Homogeneity of variance is tested using Levene’s test. The p-value for the Levene’s test of equal variance for admission chance between two groups was p=0.13 > 0.05. Therefore, we fail to reject H0 (H0:σ12=σ22). It’s safe to assume equal variance.
admission_research <- admission %>% filter(Research == "Yes")
admission_research$`Chance of Admit` %>% qqPlot(dist="norm") ## [1] 37 18
admission_noresearch <- admission %>% filter(Research == "No")
admission_noresearch$`Chance of Admit` %>% qqPlot(dist = "norm") ## [1] 37 171
Hypothesis:
H0: u1 - u2 = 0 same mean between two groups
Ha: u1 - u2 > 0 group with research experience has highter mean than group without
According to two-sample t-test results, the one-tailed p-value P < 0.05. H0 is rejected. There is a statistically significant difference between the means. Students with research experience have a statistically significant higher admission chance than students without research experience.
##
## Two Sample t-test
##
## data: Chance of Admit by Research
## t = 14.539, df = 498, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.1374803 Inf
## sample estimates:
## mean in group Yes mean in group No
## 0.7899643 0.6349091
Scatter plot is used to visualise the relationship between admission chance and GPA. Based on the output, no outlier was found which further confirmed the previous decision (keep outlier). Furthermore, the scatter plot predicts a possible linear relationship between admission chance and GPA. This will be tested later.
plot(`Chance of Admit` ~ CGPA, data=admission, ylab = "Admission Chance", xlab = "Undergraduate GPA", main = "Admission chance affected by GPA")
model <- lm(`Chance of Admit` ~ CGPA, data=admission)
summary(model)##
## Call:
## lm(formula = `Chance of Admit` ~ CGPA, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276592 -0.028169 0.006619 0.038483 0.176961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.04434 0.04230 -24.69 <2e-16 ***
## CGPA 0.20592 0.00492 41.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06647 on 498 degrees of freedom
## Multiple R-squared: 0.7787, Adjusted R-squared: 0.7782
## F-statistic: 1752 on 1 and 498 DF, p-value: < 2.2e-16
H0:The data does not fit the linear regression model HA:The data fit the linear regression model F-test is performed. Based on the output, p-value < 2.2e-16 so Ho is rejected. It’s safe to assume a linear relationship between the two variables.
(Intercept) -1.04434 0.04230 -24.69 <2e-16 CGPA 0.20592 0.00492 41.85 <2e-16
As per above results, p < 0.05 for both intercept and slop. Therefore, H0 is rejected for both. It’s safe to assume intercept and slot are non zero.
The research design ensured the independence among each students.
Linearity is checked and confirmed as per previous slides.
Residuals fall close to the line in the “Normal Q-Q” graph.
The trend line is roughly flat at 0 as per the “Residuals vs Fitted” graph. The red line should is close to flat in the “Scale-Location” graph. Also, variance in the square root of the standardised residuals is roughly consistent across predicted. In “Residuals vs Leverage”, no values fall outside of the band. No influencial cases.
Using the estimated linear regression model: Chance of Admit = -1.044 + 0.206*CGPA
A hypothesis test for r has the following statistical hypotheses:
H0:r=0
HA:r!=0
R reports the correlation between admission chance and GPA to be r=0.88 and the p-value = 0 <.001.
H0 is rejected. There was a statistically significant positive correlation between admission chance and GPA.
bivariate<-as.matrix(dplyr::select(admission, `Chance of Admit`,CGPA)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")## Chance of Admit CGPA
## Chance of Admit 1.00 0.88
## CGPA 0.88 1.00
##
## n= 500
##
##
## P
## Chance of Admit CGPA
## Chance of Admit 0
## CGPA 0
Conclusion
Advantages and Limitations
Recommendations
Data Sourced from:
https://www.kaggle.com/mohansacharya/graduate-admissions
Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019
Data licensed under CC0: Public Domain