The data used for this tutorial is downloaded from Wiley’s website. You have the data, but how do you read data in from your files?
READING DATA IN FROM FILES
The first thing you need to know is the Working Directory. The Working Directory is the folder in which you can put data so that the R console can find it and read it in. The getwd() displays the current working directory:
getwd()
# My working directory in Rstudio is:
#>> [1] "C:/Users/pkolowe/RHOME"
The above working directory is what I get when I run the getwd() command in the R console. If you don’t like your working directory, you can change it with the setwd() command. keep in mind that even on windows machines R expects directory paths to be specified with forward slashes. For instance,
setwd(“/Users/Ashleyrich/datafiles”)
Use this command to set your working directory to a place where you’re happy to put in some data. You can start by putting the downloaded data WineKMC.csv file in that directory. Read the data in and take a look. Because the data is csv file, read it using the read.csv() function.
# Read the data using the read.csv() function
winedata <- read.csv("C:/Users/pkolo/OneDrive/Documents/WineKMC.csv")
READ THE FIRST OBSERVATION OF THE DATA winedata
We are going to print the first few columns(here 9 columns to fit on this page) to see descriptive data about each of the 32 offers followed by some customers.
# Read the first 9 columns out of 107 using the code winedata[, 1:9]
print(winedata[, 1:9])
From the output, we can see the names of the first 9 columns in our data. We could have used the following code to print the first 9 columns, but it would be unecessarily long. Below is a long code:
print(winedata, subset = c(Offer, Mth, Varietal , MinQty, Disc, Origin, PastPeak, Adams, Allen))
Following is another way of reading the first 9 columns.
print(winedata[, c("Offer", "Mth", "Varietal" , "MinQty", "Disc", "Origin", "PastPeak", "Adams", "Allen")])
The selected columns are all in! But you’ll notice that the blank spaces in purchase vectors (which Excel treats as zeroes) have beome NA values.
REPLACE MISSING VALUES WITH ZEROES
You will need to replace the NA values with 0, which can be done using the is.na() function inside of brackets:
# Using the is.na() function inside brackets to replace NA values
winedata[is.na(winedata)] <- 0
After replacing the NA values with 0, check the data winedata to make sure the changes have happened. Let’s check the first 10 rows and columns 8 to 17 with missing values.
# Select rows 1:10 and columns 8:17 of the winedata dataset
winedata[1:10, 8:17]
DOING SOME ACTUAL DATA SCIENCE
A this point we have learned how to read data into R from a CSV file. Since we have already loaded the wine data, we will start with a little K-means clustering.
Spherical K-Means on Wine Data in Just a Few Lines
In this section, we’ll cluster based on cosine similarity, also called Spherical K-means. In R, there is a spherical k-means package we can load, called skmeans. But skmeans is written by a third party as a package we can download into R and use. Like most packages in R, you can read it and install it from the Comprehensive R Archive Network (CRAN). CRAN is a repository of many of th useful packages that can be loaded into R to extent its functionality. A list of packages you can download from CRAN available here: http://cran.r-project.org/web/packages.
So to get the skmeans package installed in R, you need only use the install.packages() function. I have already installed the package. I will then load it using library(skmeans). If you want to learn more on the use of skmeans() function, just type ?skmeans in the console and the help result will pop up.
The documentation specifies that skmeans() accepts a matrix where each row corresponds to an object to cluster.
# Install skmeans package
library(skmeans)
Our data on the other hand is column-oriented with a bunch of deal descriptors at the beginning that the algorithm isn’t going to see. We will need to transpose the data. The transpose function coerces a matrix out of the dataframe.
If you use the ncol() function, you can see that the customer columns go out to column 107, so you can isolate just the purchase vectors as rows for each customer by transposing the data from column 8 to 107 and saving it in a new variable called winedata.transposed. This process is known as DATA TRANSFORMATION.
# Check all the data columns in the winedata using ls() and names()
names(winedata)
[1] "Offer" "Mth" "Varietal" "MinQty" "Disc" "Origin" "PastPeak" "Adams"
[9] "Allen" "Anders" "Bailey" "Baker" "Barnes" "Bell" "Bennett" "Brooks"
[17] "Brown" "Butler" "Campbell" "Carter" "Clark" "Collins" "Cook" "Cooper"
[25] "Cox" "Cruz" "Davis" "Diaz" "Edwards" "Evans" "Fisher" "Flores"
[33] "Foster" "Garcia" "Gomez" "Gonzalez" "Gray" "Green" "Gutierrez" "Hall"
[41] "Harris" "Hernandez" "Hill" "Howard" "Hughes" "Jackson" "James" "Jenkins"
[49] "Johnson" "Jones" "Kelly" "King" "Lee" "Lewis" "Long" "Lopez"
[57] "Martin" "Martinez" "Miller" "Mitchell" "Moore" "Morales" "Morgan" "Morris"
[65] "Murphy" "Myers" "Nelson" "Nguyen" "Ortiz" "Parker" "Perez" "Perry"
[73] "Peterson" "Phillips" "Powell" "Price" "Ramirez" "Reed" "Reyes" "Richardson"
[81] "Rivera" "Roberts" "Robinson" "Rodriguez" "Rogers" "Ross" "Russell" "Sanchez"
[89] "Sanders" "Scott" "Smith" "Stewart" "Sullivan" "Taylor" "Thomas" "Thompson"
[97] "Torres" "Turner" "Walker" "Ward" "Watson" "White" "Williams" "Wilson"
[105] "Wood" "Wright" "Young"
To check the numbers of columns, rows, and length of the data.
# Check numbers of columns, rows, and length of the data
ncol(winedata)
[1] 107
nrow(winedata)
[1] 32
length(winedata)
[1] 107
Transposing our data using the t() function
# Create a new object called winedata.transposed
winedata.transposed <- t(winedata[, 8:107])
Check the content of the transposed data.
winedata.transposed[1:10, 1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
Adams 0 0 0 0 0 0 0 0 0 0
Allen 0 0 0 0 0 0 0 0 1 0
Anders 0 0 0 0 0 0 0 0 0 0
Bailey 0 0 0 0 0 0 1 0 0 0
Baker 0 0 0 0 0 0 1 0 0 1
Barnes 0 0 0 0 0 0 0 0 0 1
Bell 0 1 0 0 0 0 0 0 0 0
Bennett 0 0 0 0 0 0 0 1 0 0
Brooks 0 0 1 0 0 0 0 1 0 0
Brown 0 0 0 0 0 0 1 0 0 0
We now have columns 8 to 107 in the rows of the dataset. You can then call skmeans on the dataset, specifying five means and the use of genetic algorithm. We can assign the result back to an object called winedata.clusters:
# Create an object called winedata.clusters
winedata.Clusters <- skmeans(winedata.transposed, 5, method = "genetic")
winedata.Clusters
A hard spherical k-means partition of 100 objects into 5 classes.
Class sizes: 21, 25, 21, 16, 17
Call: skmeans(x = winedata.transposed, k = 5, method = "genetic")
Calling str() on the clusters object shows that the actual cluster assignments are stored within the “cluster” list of the objects:
# Displaying the structure of winedata.Clusters
str(winedata.Clusters)
List of 7
$ prototypes: num [1:5, 1:32] 0 0.097 0.2678 0.0337 0 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:5] "1" "2" "3" "4" ...
.. ..$ : NULL
$ membership: NULL
$ cluster : Named int [1:100] 1 2 4 1 3 3 4 5 5 1 ...
..- attr(*, "names")= chr [1:100] "Adams" "Allen" "Anders" "Bailey" ...
$ family :List of 7
..$ description: chr "spherical k-means"
..$ D :function (x, prototypes)
..$ C :function (x, weights, control)
..$ init :function (x, k)
..$ e : num 1
..$ .modify : NULL
..$ .subset : NULL
..- attr(*, "class")= chr "pclust_family"
$ m : num 1
$ value : num 38.1
$ call : language skmeans(x = winedata.transposed, k = 5, method = "genetic")
- attr(*, "class")= chr [1:2] "skmeans" "pclust"
Suppose that we want to pull back the cluster assignments for row 4, we would have to use the matrix notation on the cluster vector.
# Pulling the cluster assignment
winedata.Clusters$cluster[4]
Bailey
1
Now, each row is labelled with a customer’s name since they were labelled when we read the data with read.csv() function. Furthermore, we can also pull assignment using the row.names() function combined with the which() function.
# Pull label for the customer named Wright
winedata.Clusters$cluster[which(row.names(winedata.transposed)=="Wright")]
Wright
2
# Brooks
winedata.Clusters$cluster[which(row.names(winedata.transposed)=="Brooks")]
Brooks
5
PERFORM THE COUNTS USING THE AGGREGATE() FUNCTION
To do the counts, we just use the aggregate() function where in the “by” field we specify the cluster assignments - meaning “aggregate purchases by assignment”. In addition, it is better to specify the kind of assignment we want, such as min, max, mean, sum, median, and so on:
aggregate(winedata.transposed, by = list(winedata.Clusters$cluster), sum)
Let’s do more work on this output. We will flip the first row of the aggregation, which only gives us the cluster assignment names. Then, store all this back as a variable called winedata.Clusterscounts:
# Store cluster assignment as a new varibale
winedata.Clusterscounts <- t(aggregate(winedata.transposed, by = list(winedata.Clusters$cluster), sum)[,2:33])
# Display the output of winedata.Clusterscounts
winedata.Clusterscounts
[,1] [,2] [,3] [,4] [,5]
V1 0 3 6 1 0
V2 0 0 4 6 0
V3 1 2 0 0 3
V4 0 9 2 0 1
V5 0 4 0 0 0
V6 0 10 0 0 2
V7 14 1 1 0 3
V8 3 0 0 0 17
V9 0 10 0 0 0
V10 1 1 3 1 1
V11 0 3 9 0 1
V12 1 0 3 1 0
V13 4 0 0 0 2
V14 0 7 2 0 0
V15 0 5 1 0 0
V16 0 3 1 1 0
V17 0 0 0 7 0
V18 9 0 1 0 4
V19 0 1 2 0 2
V20 0 4 2 0 0
V21 1 1 1 0 1
V22 0 8 10 0 3
V23 0 4 0 1 0
V24 0 0 0 12 0
V25 0 2 4 0 0
V26 0 3 0 12 0
V27 0 7 0 1 1
V28 1 0 5 0 0
V29 12 0 1 0 4
V30 11 1 5 0 5
V31 0 5 10 0 2
V32 0 3 1 0 0
These are our counts of deals by cluster. Let’s put those seven columns of descriptive data back on to the deals using the column bind cbind():
# Here are the seven columns from our original data
names(winedata[,1:7])
[1] "Offer" "Mth" "Varietal" "MinQty" "Disc" "Origin" "PastPeak"
# Combining the seven columns using cbind()
winedata.desc.plus.counts <- cbind(winedata[,1:7], winedata.Clusterscounts)
# Display the contents of winedata.desc.plus.counts
winedata.desc.plus.counts
We can sort the output using the order() function inside the brackets of the dataframe. Here is a sort to discover the most popular deals for cluster 1. We will put a minus sign before the data to sort it in descending order. You can, alternatively, use decreasing = TRUE flag in the order() function:
# Order our data in descending order
winedata.desc.plus.counts[order(-winedata.desc.plus.counts[,8]),]
Assessing the top deals, it becomes clear cluster 1 is the PROSECCO cluster from Australia. Note that results may vary because the genetic algorithm doesn’t give the same answer each time. You can also check the top deals in cluster 2 in the following code.
# Top deals in cluster 2
winedata.desc.plus.counts[order(-winedata.desc.plus.counts[,9]),]
Again in the second cluster, the top deal is PROSECCO cluster from Chile. Let’s check the third and last cluster of this tutorial.
# Third cluster
winedata.desc.plus.counts[order(-winedata.desc.plus.counts[,10]),]
CHAMPAGNE from France prevails in the third cluster as the most popular items.