This analysis generates a K-means clustering of more than 9.2 million NYSE observations and 6 columns, including Openening Price, High Price, Low Price, Closing Price, Trading Volume and Closing Adjusted Price.

K-means clustering is an unsupervised learning method used to uncover non-random structures in your data. It does so by creating labels for your training data, where each label is a cluster. An ideal cluster analysis groups points together on a scatterplot, with maximized space (seperation) between groups, and minimized space (cohesion) between points in the same group. Groups can be centered around datapoints (or medoids), or around a non-real point (k-means). This analysis uses the latter.

This is just one of the five machine learning modeling guides you can find here.

Overview

Perform k-means clustering of the New York Stock Exchange (NYSE) dataset that has more than 9,211,031 NYSE trade data. Columns from 1 to 7 are:

ID INTEGER record ID
OPEN_P DOUBLE open price
HIGH_P DOUBLE highest price
LOW_P DOUBLE lowest price
CLOSE_P DOUBLE close price
VOLUME DOUBLE volume
CLOSE_ADJ_P DOUBLE close adjusted price

Use columns 2 to 7 from the input data and perform the k-means clustering with k = 4. Set the maximum number of iterations to 10,000.

Check the data before you perform the clustering task 1. Output the final four centers that were generated from this clustering process. Output those four centers.

Pre-Modeling

Given how large this dataset is, it will be interesting to measure how long it takes for the processor to perform this calculation. This computer is running an i7-65000 with 16GB DDR4-2400 RAM. You can use CPU-Z to determine your RAM speed.

Start the stopwatch!

start_time <- Sys.time()

Load Required Packages

There are two ways to load the required packages.

Install pacman using the following code.

# #install.packages("pacman")
# library("pacman")

Or use this function and see if it works for you. If not, again, try the code above.

#Check for and install required packages, using pacman
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load() #No packages required :)

Load Data

The dataset we will be loading appears as:

Document Preview

tradingData <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/7 kmeans/NYSE_DM.csv", header = TRUE)

Preview Data

Examine the data structure.

#Preview structure
str(tradingData)

## 'data.frame':    9211030 obs. of  7 variables:
##  $ X157801: int  279752 346856 347167 347169 347170 347175 347216 512385 512386 623895 ...
##  $ X25    : num  25 25 25 25 25 ...
##  $ X25.1  : num  25 25 25 25 25 ...
##  $ X25.2  : num  25 25 25 25 25 ...
##  $ X25.3  : num  25 25 25 25 25 ...
##  $ X0     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1.32  : num  6.34 4.96 4.62 4.62 4.62 ...

Examine the top 5 rows.

#Preview top 5 rows
head(tradingData, n=5)

##   X157801 X25 X25.1 X25.2 X25.3 X0 X1.32
## 1  279752  25    25    25    25  0  6.34
## 2  346856  25    25    25    25  0  4.96
## 3  347167  25    25    25    25  0  4.62
## 4  347169  25    25    25    25  0  4.62
## 5  347170  25    25    25    25  0  4.62

Preprocessing

The dataset is raw and requires some tidying up. We’ll give the dataset header names and drop the ID column, which carries no useful information.

#give col names
colnames(tradingData) <- c("ID","OPEN_P", "HIGH_P", "LOW_P", "CLOSE_P", "VOLUME", "CLOSE_ADJ_P")

#drop ID column
tradingData <- tradingData[, -c(1)]

Next, we need to standardize the dataset so that clusters are appropriately generated. Without standardization, columns containing greater absolute value variation will show less cohesion than columns with smaller absolute value variation. Consider, for instance, building a scatterplot with Age on one axis, and Salary on the other.

#standardize the continuous numbers
tradingData <- scale(tradingData)

Modeling

We’ll now use the k-means alogorithm to fit a model. It will divide all datapoints into 4 clusters, and allocate each point to a cluster group using a distance (euclidean) measurement. We’ll set it to iteratively determine which point belongs to each cluster 10,000 times, allowing for each iteration to recalculate where the center of the groups is and which points are closest to those centers.

#perform kmeans with k = 4
fit <- kmeans(tradingData, centers = 4, iter.max = 10000)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 460551500)

We can visualize the coefficients with the following plot.

#display the results, taken from the fit table centers
fit$centers

##       OPEN_P     HIGH_P      LOW_P    CLOSE_P      VOLUME CLOSE_ADJ_P
## 1 11.9464343 12.4043402 12.3989843 12.3883239 -0.15289614  11.1098074
## 2 -0.3102046 -0.3158623 -0.3156256 -0.3159013 -0.07726181  -0.1740092
## 3  1.1070029  1.1243669  1.1234951  1.1247201  0.28433773   0.5691728
## 4 -0.3643898 -0.3528159 -0.3903666 -0.3710729 62.12446379  -0.1150070

View(fit$centers)

Click the stopwatch off.

#stop watch
end_time <- Sys.time()
total_time <- end_time - start_time
total_time

## Time difference of 1.257495 mins

The total time required to process all 9,211,031 rows into four clusters was 1.2574951 minutes.

Results

K-means clustering was used on NYSE 9.2 million observations of opening, high, low, closing and closing adjusted prices, as well as trading volume. This laptop required 1.2574951 minutes to associate each observation to one of four clusters, iterated 10,000 times. This analysis shows that suprisingly little time is required to analyze big and tall datasets.

Thank you for reading, and happy clustering!

R

start_time <- Sys.time()

# #install.packages("pacman")
# library("pacman")

#Check for and install required packages, using pacman
#if (!require("pacman")) install.packages("pacman")
#pacman::p_load() #No packages required :)

tradingData <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/7 kmeans/NYSE_DM.csv", header = TRUE)

#Preview structure
str(tradingData)

#Preview top 5 rows
head(tradingData, n=5)

#give col names
colnames(tradingData) <- c("ID","OPEN_P", "HIGH_P", "LOW_P", "CLOSE_P", "VOLUME", "CLOSE_ADJ_P")

#drop ID column
tradingData <- tradingData[, -c(1)]

#standardize the continuous numbers
tradingData <- scale(tradingData)

#perform kmeans with k = 4
fit <- kmeans(tradingData, centers = 4, iter.max = 10000)

#display the results, taken from the fit table centers
fit$centers
View(fit$centers)

#stop watch
end_time <- Sys.time()
total_time <- end_time - start_time
total_time

Markdown

To view this entire document’s markdown code, click here.

Publications

Videos

I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.

If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.

More RPubs:

Contact

Email: mortensengarth@hotmail.com

LinkedIn: https://www.linkedin.com/in/mortensengarth/.

K-Means: Clustering NYSE Trades

Garth Mortensen

September 2, 2018