Summary (Boxplot)
## GRE Score TOEFL Score University Rating SOP
## Min. :290.0 Min. : 92.0 Min. :1.000 Min. :1.0
## 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000 1st Qu.:2.5
## Median :317.0 Median :107.0 Median :3.000 Median :3.5
## Mean :316.8 Mean :107.4 Mean :3.087 Mean :3.4
## 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000 3rd Qu.:4.0
## Max. :340.0 Max. :120.0 Max. :5.000 Max. :5.0
## LOR CGPA Chance of Admit
## Min. :1.000 Min. :6.800 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.170 1st Qu.:0.6400
## Median :3.500 Median :8.610 Median :0.7300
## Mean :3.453 Mean :8.599 Mean :0.7244
## 3rd Qu.:4.000 3rd Qu.:9.062 3rd Qu.:0.8300
## Max. :5.000 Max. :9.920 Max. :0.9700
The summary above shows that the distribution of our dataset doesn’t have any anomaly. In every variable the median and the mean value has a relatively equal number, this shows that the data or the scores doesn’t have any significant errors in them, we can visualize it better by looking at the plot below.
adm %>%
select(-c("Serial No.", "Research")) %>%
pivot_longer(c("GRE Score","TOEFL Score","University Rating","SOP","LOR","CGPA","Chance of Admit"),
names_to = "Parameter", values_to = "Score") %>%
ggplot(aes(x = Parameter, y = Score)) +
geom_boxplot(fill = "maroon", color = "black") +
facet_wrap(~Parameter, scales = "free" ) +
theme_minimal()

Data Distributions
Now we have to see whether the data that we pull is normally distributed or not. Normally distributed data suggests that the sample that we pulled in represents the population because a well-made grade score of students is normally distributed. First of all, we need to visualize the data that we’d like to test.
hist(adm$'GRE Score', breaks = 1+3.322*log(400))

hist(adm$`TOEFL Score`, breaks = 1 + 3.322*log(400))

hist(adm$CGPA, breaks = 1 + 3.322*log(400))

hist(adm$`Chance of Admit`, breaks = 1+3.322*log(400))

Dari histogram di atas beberapa data terlihat terdistribusi normal, selanjutnya kita akan menggunakan Kolmogorov-Smirnov Test dimana akan dihasilkan nilai pasti bahwa daata terdistribusi dengan normal atau tidak
Pada r kita akan menggunakan ks.test() yang akan berisikan value dari masing-masing variabel. value dari masing-masing variabel ini akan dilakukan scaling terlebih dahulu agar data yang diuji adalah data dari nilai min hingga max. apabila kita tidak melakukan scaling data akan skew. Misal pada CGPA, sebaran nilai peserta berada pada range 6.5 hingga 10.0 apabila tidak kita scaling maka ks.test() akan menganggap pengujian dari nilai minimumnya adalah 0 sehingga akan salah interpretasi.
ks.test(scale(adm$`GRE Score`), "pnorm")
## Warning in ks.test(scale(adm$`GRE Score`), "pnorm"): ties should not be present
## for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$`GRE Score`)
## D = 0.050095, p-value = 0.268
## alternative hypothesis: two-sided
ks.test(scale(adm$`TOEFL Score`), "pnorm")
## Warning in ks.test(scale(adm$`TOEFL Score`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$`TOEFL Score`)
## D = 0.057709, p-value = 0.1392
## alternative hypothesis: two-sided
ks.test(scale(adm$`University Rating`), "pnorm")
## Warning in ks.test(scale(adm$`University Rating`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$`University Rating`)
## D = 0.19549, p-value = 0.0000000000001055
## alternative hypothesis: two-sided
ks.test(scale(adm$`SOP`), "pnorm")
## Warning in ks.test(scale(adm$SOP), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$SOP)
## D = 0.12438, p-value = 0.000008432
## alternative hypothesis: two-sided
ks.test(scale(adm$`LOR`), "pnorm")
## Warning in ks.test(scale(adm$LOR), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$LOR)
## D = 0.12136, p-value = 0.00001528
## alternative hypothesis: two-sided
ks.test(scale(adm$`CGPA`), "pnorm")
## Warning in ks.test(scale(adm$CGPA), "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$CGPA)
## D = 0.044395, p-value = 0.4097
## alternative hypothesis: two-sided
ks.test(scale(adm$`Chance of Admit`), "pnorm")
## Warning in ks.test(scale(adm$`Chance of Admit`), "pnorm"): ties should not be
## present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: scale(adm$`Chance of Admit`)
## D = 0.049712, p-value = 0.2763
## alternative hypothesis: two-sided
Dari uji statistik di atas, University Rating, SOP, LOR tidak terdistribusi normal, ini tidak apa-apa karena penilaian tersebut secara natural tidak berdistribusi normal. Dari sini, dapat dipastikan bahwa semua data dapat digunakan.