Image credit: Aditya Vyas.
This analysis generates a K-means clustering of more than 9.2 million NYSE observations and 6 columns, including Openening Price, High Price, Low Price, Closing Price, Trading Volume and Closing Adjusted Price.
K-means clustering is an unsupervised learning method used to uncover non-random structures in your data. It does so by creating labels for your training data, where each label is a cluster. An ideal cluster analysis groups points together on a scatterplot, with maximized space (seperation) between groups, and minimized space (cohesion) between points in the same group. Groups can be centered around datapoints (or medoids), or around a non-real point (k-means). This analysis uses the latter.
This is just one of the five machine learning modeling guides you can find here.
Perform k-means clustering of the New York Stock Exchange (NYSE) dataset that has more than 9,211,031 NYSE trade data. Columns from 1 to 7 are:
Use columns 2 to 7 from the input data and perform the k-means clustering with k = 4. Set the maximum number of iterations to 10,000.
Check the data before you perform the clustering task 1. Output the final four centers that were generated from this clustering process. Output those four centers.
Given how large this dataset is, it will be interesting to measure how long it takes for the processor to perform this calculation. This computer is running an i7-65000 with 16GB DDR4-2400 RAM. You can use CPU-Z to determine your RAM speed.
Start the stopwatch!
start_time <- Sys.time()
There are two ways to load the required packages.
# #install.packages("pacman")
# library("pacman")
#Check for and install required packages, using pacman
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load() #No packages required :)
The dataset we will be loading appears as:
Document Preview
tradingData <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/7 kmeans/NYSE_DM.csv", header = TRUE)
Examine the data structure.
#Preview structure
str(tradingData)
## 'data.frame': 9211030 obs. of 7 variables:
## $ X157801: int 279752 346856 347167 347169 347170 347175 347216 512385 512386 623895 ...
## $ X25 : num 25 25 25 25 25 ...
## $ X25.1 : num 25 25 25 25 25 ...
## $ X25.2 : num 25 25 25 25 25 ...
## $ X25.3 : num 25 25 25 25 25 ...
## $ X0 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1.32 : num 6.34 4.96 4.62 4.62 4.62 ...
Examine the top 5 rows.
#Preview top 5 rows
head(tradingData, n=5)
## X157801 X25 X25.1 X25.2 X25.3 X0 X1.32
## 1 279752 25 25 25 25 0 6.34
## 2 346856 25 25 25 25 0 4.96
## 3 347167 25 25 25 25 0 4.62
## 4 347169 25 25 25 25 0 4.62
## 5 347170 25 25 25 25 0 4.62
The dataset is raw and requires some tidying up. We’ll give the dataset header names and drop the ID column, which carries no useful information.
#give col names
colnames(tradingData) <- c("ID","OPEN_P", "HIGH_P", "LOW_P", "CLOSE_P", "VOLUME", "CLOSE_ADJ_P")
#drop ID column
tradingData <- tradingData[, -c(1)]
Next, we need to standardize the dataset so that clusters are appropriately generated. Without standardization, columns containing greater absolute value variation will show less cohesion than columns with smaller absolute value variation. Consider, for instance, building a scatterplot with Age on one axis, and Salary on the other.
#standardize the continuous numbers
tradingData <- scale(tradingData)
We’ll now use the k-means alogorithm to fit a model. It will divide all datapoints into 4 clusters, and allocate each point to a cluster group using a distance (euclidean) measurement. We’ll set it to iteratively determine which point belongs to each cluster 10,000 times, allowing for each iteration to recalculate where the center of the groups is and which points are closest to those centers.
#perform kmeans with k = 4
fit <- kmeans(tradingData, centers = 4, iter.max = 10000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 460551500)
We can visualize the coefficients with the following plot.
#display the results, taken from the fit table centers
fit$centers
## OPEN_P HIGH_P LOW_P CLOSE_P VOLUME CLOSE_ADJ_P
## 1 11.9464343 12.4043402 12.3989843 12.3883239 -0.15289614 11.1098074
## 2 -0.3102046 -0.3158623 -0.3156256 -0.3159013 -0.07726181 -0.1740092
## 3 1.1070029 1.1243669 1.1234951 1.1247201 0.28433773 0.5691728
## 4 -0.3643898 -0.3528159 -0.3903666 -0.3710729 62.12446379 -0.1150070
View(fit$centers)
Click the stopwatch off.
#stop watch
end_time <- Sys.time()
total_time <- end_time - start_time
total_time
## Time difference of 1.257495 mins
The total time required to process all 9,211,031 rows into four clusters was 1.2574951 minutes.
K-means clustering was used on NYSE 9.2 million observations of opening, high, low, closing and closing adjusted prices, as well as trading volume. This laptop required 1.2574951 minutes to associate each observation to one of four clusters, iterated 10,000 times. This analysis shows that suprisingly little time is required to analyze big and tall datasets.
Thank you for reading, and happy clustering!
start_time <- Sys.time()
# #install.packages("pacman")
# library("pacman")
#Check for and install required packages, using pacman
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load() #No packages required :)
tradingData <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/7 kmeans/NYSE_DM.csv", header = TRUE)
#Preview structure
str(tradingData)
#Preview top 5 rows
head(tradingData, n=5)
#give col names
colnames(tradingData) <- c("ID","OPEN_P", "HIGH_P", "LOW_P", "CLOSE_P", "VOLUME", "CLOSE_ADJ_P")
#drop ID column
tradingData <- tradingData[, -c(1)]
#standardize the continuous numbers
tradingData <- scale(tradingData)
#perform kmeans with k = 4
fit <- kmeans(tradingData, centers = 4, iter.max = 10000)
#display the results, taken from the fit table centers
fit$centers
View(fit$centers)
#stop watch
end_time <- Sys.time()
total_time <- end_time - start_time
total_time
To view this entire document’s markdown code, click here.
I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.
If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.