In this project we will be analyzing a data set containing GRE and GPA grades, and trying to predict if a student is accepted or not. At first we will execute some EDA in the form of basic descriptive statistics and visualisations and then we will dive into implementing a k-means algorithm, and a support vector machine (SVM). Thus we can address a question that is on every student’s mind when applying to graduate school!
First step is getting the data, luckily for us it is in the csv format and has already been cleaned (no NAs here!).
df <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# also let's check if there are any NA's (just in case)
any(is.na(df))
## [1] FALSE
Ok, there are no missing values. Then we can proceed with loading the required dependencies.
library(aod)
library(ggplot2)
library(caret)
require(gridExtra)
Some initial data exploration.
str(df); dim(df); head(df)
## 'data.frame': 400 obs. of 4 variables:
## $ admit: int 0 1 1 1 0 1 1 0 1 0 ...
## $ gre : int 380 660 800 640 520 760 560 400 540 700 ...
## $ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
## $ rank : int 3 3 1 4 4 2 1 2 3 2 ...
## [1] 400 4
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
So we have data on 400 different students. For every student we know whether they have been admitted or not (1 or 0), their gre score (exam), gpa (grade point average, 4 highest and 0 lowest) and rank.
summary(df[,2:3])
## gre gpa
## Min. :220.0 Min. :2.260
## 1st Qu.:520.0 1st Qu.:3.130
## Median :580.0 Median :3.395
## Mean :587.7 Mean :3.390
## 3rd Qu.:660.0 3rd Qu.:3.670
## Max. :800.0 Max. :4.000
One of the most useful functions for obtaining descriptive statistics, the summary function shows us in more detail the distribution of the data points. We can see that the average gre score is 587.7 and the average gpa is 3.39. There is also at least one person having obtained a maximum gre score and a maximum gpa. For those not familiar with the test format, 800 is the maximum score.
Since the summary function is not appropriate to analyze categorical data such as the admission status (admit column) or rank we are going to use two other functions to better understand our data - table and prop.table.
table(df$admit)
##
## 0 1
## 273 127
prop.table(table(df$admit))
##
## 0 1
## 0.6825 0.3175
This shows us both the absolute number of people admitted and rejected, also their relative proportion. Let’s do the same for the rank variable.
table(df$rank)
##
## 1 2 3 4
## 61 151 121 67
prop.table(table(df$rank))
##
## 1 2 3 4
## 0.1525 0.3775 0.3025 0.1675
After those initial steps we can start to use R’s awesome plotting capabilities for further EDA. Two very useful plots that we can do are histograms and boxplots. Thus we can effectively visialize the distribution and check the outliers in the continious variables.
By examiningf those plots we can see that most people have gre scores between 400 and 700, and gpa between 3 and 3.7. What we can see here and was not accessible through the summary descriptive stats is the high number of students that have maximum values for both variables. The boxplots confirm our observations.
Now let’s get started with drilling deeper for insights. One often used pre-processing step is a Principal Component Analysis (PCA). This technique is especially useful for datasets with large number of features (explanatory variables, gre, gpa and rank in our case, where we cannot easily deduce which to select). In such cases PCA can select a mixture of features that preserve the most of the variation (information) in our dataset, so that we can use them for further algorithmic work. Since in our case we just have three features, we probably do not need to use a PCA, but let’s run it and see how it works.
data.pca <- prcomp(df[,2:4],
center = TRUE,
scale. = TRUE)
plot(data.pca, type = "b")
Unsurprisingly the first principal component explains a larger amount of variation, but the other two are not too low, so in this case we do not need to remove features from our model.
Now lets dive into something really cool, and see if there are actually several different groups (cluster of students) in our data. Choosing the k parameter is often more an art than science, but in most cases assigning a value of 3 is a reasonable number.
kmeans_df <- df[,2:3]
kmeans <- kmeans(kmeans_df, 3)
plot(kmeans_df[c("gre", "gpa")], col = kmeans$cluster)
points(kmeans$centers[,c("gre", "gpa")], col = 1:3, pch = 8, cex = 2)
By analyzing this plot we can easily observe that there are three groups of students. And finally we can use a SVN to try to predict if a student gets admitted or not (we could use the k-means model for the same purpose, but let’s do that as an exercise). The following code is in python.
import pandas as pd
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
# normalize dataset
def normalize(series):
return (series - series.min()) / (series.max() - series.min())
df.gre = normalize(df.gre)
df.gpa = normalize(df.gpa)
# Split into training and test
features = df.loc[:, ['gre', 'gpa']]
labels = df['admit']
features_train, features_test, labels_train, labels_test = train_test_split(features, labels,
test_size = 0.33, random_state = 22)
clf = svm.SVC()
clf.fit(features_train, labels_train)
predictions = clf.predict(features_test)
accuracy_score(labels_test, predictions)
# output:
0.71212121212121215
This accuracy is not ideal and we could improve it by adjusting a few of the parameters. Try it for yourself to see if that works. Another important step of the data science process that was not described in this project was cross validation. This is a procedure to make sure we have done our sampling right (without introducing any bias).
Planned updates:
logistic regression
k-means use as a classifier (not just visualisation)