Warning message:
R graphics engine version 16 is not supported by this version of RStudio. The Plots tab will be disabled until a newer version of RStudio is installed.
Pursuing higher study in USA is dream for every enthusiasm. However, it is pretty tough to get acceptance as there are several things which we require to consider. So, in this project I want to explore the importance of different factor that may enhance the probability of chance. Also, I will present some visualization. In this work, I am exploring US Admission Dataset.
-Serial.No. -TOEFL.Score (Out of 120) -University.Rating (Out of 5) -GRE.Score (Out of 340) -SOP -LOR -CGPA -Research -Chance.of.Admit (Target Variable)
library(dplyr)
print('Dataset Summary')
[1] "Dataset Summary"
df = read.csv('US Admission.csv')
glimpse(df)
Rows: 400
Columns: 9
$ Serial.No. <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,~
$ GRE.Score <int> 337, 324, 316, 322, 314, 330, 321, 308, 302, 323, 325, 327, 328, 307, 311, 314, 317, 319, 318, 303, 312, 3~
$ TOEFL.Score <int> 118, 107, 104, 110, 103, 115, 109, 101, 102, 108, 106, 111, 112, 109, 104, 105, 107, 106, 110, 102, 107, 1~
$ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 2, 1, 2, 2, 3, 4, 5, 5, 5~
$ SOP <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0, 3.5, 3.5, 4.0, 4.0, 4.0, 3.5, 3.5, 4.0, 4.0, 4.0, 3.5, 3.0, 3~
$ LOR <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5, 3.0, 4.0, 4.5, 4.5, 3.0, 2.0, 2.5, 3.0, 3.0, 3.0, 3.0, 2.0, 2~
$ CGPA <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7.90, 8.00, 8.60, 8.40, 9.00, 9.10, 8.00, 8.20, 8.30, 8.70, 8.00~
$ Research <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1~
$ Chance.of.Admit <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0.68, 0.50, 0.45, 0.52, 0.84, 0.78, 0.62, 0.61, 0.54, 0.66, 0.65~
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This Histogram plot represents the probabilty of chance of admission. Considering rest of parameters (Features), this plot illustrate the distribution of whole dataset. Positive side is: there is no outlier
library(ggplot2)
ggplot(df,aes(x=Chance.of.Admit,y=TOEFL.Score))+
geom_point()+
theme_minimal()
This is a simple scatter plot based on TOEFL.Score.
library(ggplot2)
gg=ggplot(df, aes(x=cut(Chance.of.Admit, breaks=c(0.0,0.5,1)), y=GRE.Score))+
geom_boxplot(fill=c('red','green'),alpha=0.5)
gg
From this boxplot plot, the probability of getting chance is significant who obtained more than 310 marks in GRE. The median value of getting chance is approximately 318 in GRE. To contrast, the students who obtained less then 310 marks, their chance’s of admission is lesser.
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
library(ggplot2)
plot_ly(data=df, x=~Chance.of.Admit, y=~GRE.Score, z = ~CGPA, colours=c('red','green','sky'), type='scatter3d', mode='markets')
Warning: 'scatter3d' objects don't have these attributes: 'colours'
Valid attributes include:
'connectgaps', 'customdata', 'customdatasrc', 'error_x', 'error_y', 'error_z', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'projection', 'scene', 'showlegend', 'stream', 'surfaceaxis', 'surfacecolor', 'text', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'visible', 'x', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'ycalendar', 'yhoverformat', 'ysrc', 'z', 'zcalendar', 'zhoverformat', 'zsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
Warning: 'scatter3d' objects don't have these attributes: 'colours'
Valid attributes include:
'connectgaps', 'customdata', 'customdatasrc', 'error_x', 'error_y', 'error_z', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'projection', 'scene', 'showlegend', 'stream', 'surfaceaxis', 'surfacecolor', 'text', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'visible', 'x', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'ycalendar', 'yhoverformat', 'ysrc', 'z', 'zcalendar', 'zhoverformat', 'zsrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'
This 3D scatter plot illustrates chance of admission based on two important features including CGPA and GRE.Score.
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Highest GRE Score is 340. After plotting violin plot based on GRE Score for 5 distinct university, it reveals 4th and 5th no university’s max GRE score is 340. 5th University’s median is 330 and 4th university’s median is 325. To contrast, in 1st university, max GRE score is 318 and median is 300. So, it seems only good GRE Score is not solely importance to increase the probability of chance.
max(df$GRE.Score)
plot_ly(data=df, x=~Chance.of.Admit, y=~GRE.Score, type='box')
NA
Explanation: Here, the target variable is not categorical. Actually this dataset is appropriate for regression problem. This 3d box plot shows several box as there are several distinct value in Chance.of.Admit column. The Chance.of.Admit column denotes the probability of getting chance.
plot_ly(data=df, x=~CGPA, type='violin')
NA
plot_ly(data=df, x=~TOEFL.Score, type='histogram',color='pink',title='histogram plot')
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning: 'histogram' objects don't have these attributes: 'title'
Valid attributes include:
'_deprecated', 'alignmentgroup', 'autobinx', 'autobiny', 'bingroup', 'cliponaxis', 'constraintext', 'cumulative', 'customdata', 'customdatasrc', 'error_x', 'error_y', 'histfunc', 'histnorm', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'nbinsx', 'nbinsy', 'offsetgroup', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'x', 'xaxis', 'xbins', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'yaxis', 'ybins', 'ycalendar', 'yhoverformat', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGrati [... truncated]
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2") :
minimal value for n is 3, returning requested palette with 3 different levels
Warning: 'histogram' objects don't have these attributes: 'title'
Valid attributes include:
'_deprecated', 'alignmentgroup', 'autobinx', 'autobiny', 'bingroup', 'cliponaxis', 'constraintext', 'cumulative', 'customdata', 'customdatasrc', 'error_x', 'error_y', 'histfunc', 'histnorm', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'insidetextanchor', 'insidetextfont', 'legendgroup', 'legendgrouptitle', 'legendrank', 'marker', 'meta', 'metasrc', 'name', 'nbinsx', 'nbinsy', 'offsetgroup', 'opacity', 'orientation', 'outsidetextfont', 'selected', 'selectedpoints', 'showlegend', 'stream', 'text', 'textangle', 'textfont', 'textposition', 'textsrc', 'texttemplate', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'x', 'xaxis', 'xbins', 'xcalendar', 'xhoverformat', 'xsrc', 'y', 'yaxis', 'ybins', 'ycalendar', 'yhoverformat', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGrati [... truncated]
library(plotly)
library(ggplot2)
#(data=df, aes(x=Scope.of.Chance, y=GRE.Score))+
ggplotly(gg)
NA
correlation_matrix = cor(df[,c(2:9)])
correlation_matrix
GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research Chance.of.Admit
GRE.Score 1.0000000 0.8359768 0.6689759 0.6128307 0.5575545 0.8330605 0.5803906 0.8026105
TOEFL.Score 0.8359768 1.0000000 0.6955898 0.6579805 0.5677209 0.8284174 0.4898579 0.7915940
University.Rating 0.6689759 0.6955898 1.0000000 0.7345228 0.6601235 0.7464787 0.4477825 0.7112503
SOP 0.6128307 0.6579805 0.7345228 1.0000000 0.7295925 0.7181440 0.4440288 0.6757319
LOR 0.5575545 0.5677209 0.6601235 0.7295925 1.0000000 0.6702113 0.3968593 0.6698888
CGPA 0.8330605 0.8284174 0.7464787 0.7181440 0.6702113 1.0000000 0.5216542 0.8732891
Research 0.5803906 0.4898579 0.4477825 0.4440288 0.3968593 0.5216542 1.0000000 0.5532021
Chance.of.Admit 0.8026105 0.7915940 0.7112503 0.6757319 0.6698888 0.8732891 0.5532021 1.0000000
Serial number has no importance for admission. That’s why, I simply skip this feature. From this correlation matrix, it seems GRE.Score, TOEFL.Score,CGPA are quite significant for securing chance.
Serial number has no effect in admission process. Hence it shows high negative value. Apart from this, all features have quite good contribution to get chance. However, the strongest correlation with Chance.of.Admit comes from CGPA,TOEFL.Score and GRE.Score which accounted for 0.87%,0.79% and 0.80% respectively.
library(stats)
df_pca= prcomp(df[, c(2,3,4,5,6,7,8,9)],center=TRUE,scale=TRUE)
summary(df_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 2.3819 0.85340 0.74670 0.57054 0.50083 0.44511 0.38897 0.33904
Proportion of Variance 0.7092 0.09104 0.06969 0.04069 0.03135 0.02477 0.01891 0.01437
Cumulative Proportion 0.7092 0.80022 0.86991 0.91060 0.94195 0.96672 0.98563 1.00000
pc_score=as.data.frame(df_pca$x[ , c(1,2)])
pc_score
new_dataFrmae=cbind(pc_score,Chance.of.Admit=df$Chance.of.Admit)
new_dataFrmae
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(df_pca)
fviz_pca_var(df_pca,col.var = 'contrib',gradient.cols=c('red','black','green'))
# divide data for classification
breaks = c(0,0.5,1)
labels = c('Low Chance','High Chance')
Scope = cut(df$Chance.of.Admit, breaks = breaks, labels = labels)
#df
#Scope.of.Chance
df=cbind(df,Scope)
df
library(ggplot2)
library(e1071)
library(caret)
library(lattice)
train_idx=createDataPartition(df$Scope, p=0.8, list=FALSE)
train_data = df[train_idx, ]
test_data = df[-train_idx, ]
svm_model = svm(as.factor(Scope) ~ GRE.Score+TOEFL.Score+University.Rating+SOP+LOR+CGPA+Research, data = train_data, kernel = 'linear' )
pred = predict(svm_model, test_data)
pred
3 14 22 33 34 38 41 43 46 47 50
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance
51 58 67 76 84 89 92 116 118 119 126
High Chance Low Chance High Chance High Chance High Chance High Chance Low Chance High Chance Low Chance Low Chance High Chance
134 136 140 142 147 151 154 157 158 161 163
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance
164 169 177 191 197 199 200 202 204 206 214
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance
216 222 223 231 232 233 238 257 260 263 264
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance
269 275 280 282 284 286 287 293 297 302 305
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance
307 320 321 326 329 334 364 366 387 389 390
High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance High Chance Low Chance High Chance
391 394 395
High Chance High Chance High Chance
Levels: Low Chance High Chance
conf_mat = confusionMatrix(pred, as.factor(test_data$Scope))
#conf_mat = confusionMatrix(pred, as.factor(test_data$variety))
conf_mat
Confusion Matrix and Statistics
Reference
Prediction Low Chance High Chance
Low Chance 5 0
High Chance 2 73
Accuracy : 0.975
95% CI : (0.9126, 0.997)
No Information Rate : 0.9125
P-Value [Acc > NIR] : 0.02485
Kappa : 0.8202
Mcnemar's Test P-Value : 0.47950
Sensitivity : 0.7143
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.9733
Prevalence : 0.0875
Detection Rate : 0.0625
Detection Prevalence : 0.0625
Balanced Accuracy : 0.8571
'Positive' Class : Low Chance
#check
confusion_matrix = as.data.frame(conf_mat$table)
confusion_matrix
test_data
predict(svm_model, test_data[1, -9:-10])
3
High Chance
Levels: Low Chance High Chance
ggplot(confusion_matrix, aes(x=Prediction,y=Reference,fill=Freq))+
geom_tile(fill='pink',color='blue')+
geom_text(aes(label=Freq))
ggplot(confusion_matrix, aes(x=Prediction,y=Reference,fill=Freq))+
geom_tile()+
geom_text(aes(label=Freq))
ggplot(df,aes(x=Chance.of.Admit, y=GRE.Score,color=Chance.of.Admit))+
geom_point()+
geom_smooth(method='lm',se=TRUE, level=0.95,color='red') #se= standard error
`geom_smooth()` using formula = 'y ~ x'
linear_model_1=lm(Chance.of.Admit ~ GRE.Score, data=df)
summary(linear_model_1)
Call:
lm(formula = Chance.of.Admit ~ GRE.Score, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.33613 -0.04604 0.00408 0.05644 0.18339
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.4360842 0.1178141 -20.68 <2e-16 ***
GRE.Score 0.0099759 0.0003716 26.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08517 on 398 degrees of freedom
Multiple R-squared: 0.6442, Adjusted R-squared: 0.6433
F-statistic: 720.6 on 1 and 398 DF, p-value: < 2.2e-16
It helps us to predict most of the datapoint. Simply it covers significant numbers. At first, we will make a linear regression model where we set GRE.Score as predictor
ggplot(df, aes(x=Chance.of.Admit, y=GRE.Score, color=Chance.of.Admit))+
geom_point()+
geom_smooth(method = 'lm',formula = y~poly(x,10), se=TRUE, color='purple', level=0.95)
linear_model_2=lm(Chance.of.Admit ~ GRE.Score+CGPA+TOEFL.Score+SOP+LOR, data=df)
summary(linear_model_2)
Call:
lm(formula = Chance.of.Admit ~ GRE.Score + CGPA + TOEFL.Score +
SOP + LOR, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.279782 -0.022660 0.009455 0.036321 0.160459
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.4620860 0.1089705 -13.417 < 2e-16 ***
GRE.Score 0.0023187 0.0005773 4.016 7.08e-05 ***
CGPA 0.1227238 0.0121463 10.104 < 2e-16 ***
TOEFL.Score 0.0029182 0.0010929 2.670 0.00789 **
SOP 0.0002035 0.0053371 0.038 0.96960
LOR 0.0238696 0.0055322 4.315 2.02e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06452 on 394 degrees of freedom
Multiple R-squared: 0.7979, Adjusted R-squared: 0.7953
F-statistic: 311.1 on 5 and 394 DF, p-value: < 2.2e-16
Observation The Adjusted R-squared from the first model is 0.6433 where GRE Score is only predictor. To contrast, The Adjusted R-squared from the second model is 0.7953 considering GRE Score, TOEFL score, CGPA, SOP, LOR as predictor So the second model is better than the first one in terms of higher Adjusted R-squared value.
hist(linear_model_1$residuals)
library(stats)
k_means_cluster=kmeans(df[, 2:8], centers=2)
k_means_cluster
K-means clustering with 2 clusters of sizes 211, 189
Cluster means:
GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
1 307.8626 103.1564 2.417062 2.881517 3.047393 8.196209 0.2701422
2 326.7937 112.1587 3.835979 3.978836 3.904762 9.048519 0.8571429
Clustering vector:
[1] 2 2 1 2 1 2 2 1 1 2 2 2 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 1 2
[66] 2 2 1 2 2 2 2 2 1 1 2 2 1 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2
[131] 2 1 1 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1
[196] 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 1 1 2 2 2
[261] 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1
[326] 2 1 1 2 1 2 1 1 2 1 2 2 2 2 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 1 1 1 2
[391] 1 1 2 1 2 2 2 2 1 2
Within cluster sum of squares by cluster:
[1] 13872.62 10531.11
(between_SS / total_SS = 64.5 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
I have divided the dataset into two class (High chance and Low Chance)
library(cluster)
clusplot(df,k_means_cluster$cluster)
After exploring the US Admission dataset, we have come to know not only a specific feature may dominant the probability of chance. GRE.Score, CGPA, TOEFL Score has almost similar greatness for ensuring chance. Also, LOR and SOP remain has importance. In this project, we initially visualize some basic 2D plot. Afterward, we draw 3d interactive plot for more clarification. In addition, we investigate correlation among features and draw correlation matrix Finally we apply Support Vector Machine, Two Linear Regression Models and Clustering..