The data given by x are clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster center). My goal is to perform k-means clustering on a data matrix.
---
title: "An Example Using the K-Means Technique"
author: "Chris Rucker"
output:
tufte::tufte_handout: default
tufte::tufte_html: default
---
There are two goals of this handout:
Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) This example highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.
KEY MEMBERS OF KOBIE MARKETING1 Kobie, namely David Eppler and Meredith Tayshetye provided the inspiration for this handout.
Images and graphics play an integral role in visualizing data. For example:
Boxplot of Mediacom offers, by minimum quantity.
seg_data <- read.csv('seg_data.csv')
boxplot(seg_data$min_qty, col="red")
title("min_qty")
The box-and-whisker plot is an exploratory graphic, created by John W. Tukey, used to show the distribution of a dataset (at a glance). Think of the type of data you might use a histogram with, and the box-and-whisker (or box plot, for short) could probably be useful.
The top 50% of the group are represented by everything above the median (the red line). Those in the top 25% are shown by the top “whisker” and dots. Dots represent those who are a lot less than normal (outliers).
Compactly display the internal structure of the seg_data object, a diagnostic function and an alternative to summary (and to some extent, dput).
str(seg_data)
## 'data.frame': 600 obs. of 6 variables:
## $ offer_id: int 0 2185 3687 0 0 10 0 3723 0 0 ...
## $ campaign: chr "January" "March" "August" "April" ...
## $ varietal: chr "C" "I" "C" "C" ...
## $ min_qty : int 2 12 6 32 3 52 4 18 9 1 ...
## $ discount: num 0 0 0 0 0 0 0 0 0 0 ...
## $ origin : int 1000 2000 3000 5000 5000 5000 5000 5000 5000 5000 ...
The pastecs package has methods that provide descriptive statistics on a data frame, and summary is a generic function used to produce result summaries of the results of various model fitting functions.
require(pastecs)
## Loading required package: pastecs
## Loading required package: boot
options(scipen=100)
options(digits=3)
stat.desc(seg_data)
## offer_id campaign varietal min_qty discount origin
## nbr.val 600.00 NA NA 600.000 600.000 600.000
## nbr.null 294.00 NA NA 0.000 517.000 0.000
## nbr.na 0.00 NA NA 0.000 0.000 0.000
## min 0.00 NA NA 1.000 0.000 1000.000
## max 3723.00 NA NA 230.000 220.000 8500.000
## range 3723.00 NA NA 229.000 220.000 7500.000
## sum 539110.00 NA NA 5462.000 2205.720 2795400.000
## median 2.00 NA NA 3.000 0.000 5000.000
## mean 898.52 NA NA 9.103 3.676 4659.000
## SE.mean 46.12 NA NA 0.797 0.667 49.072
## CI.mean.0.95 90.58 NA NA 1.565 1.311 96.375
## var 1276218.90 NA NA 380.774 267.180 1444860.434
## std.dev 1129.70 NA NA 19.513 16.346 1202.023
## coef.var 1.26 NA NA 2.144 4.446 0.258
summary(seg_data)
## offer_id campaign varietal min_qty
## Min. : 0 Length:600 Length:600 Min. : 1.0
## 1st Qu.: 0 Class :character Class :character 1st Qu.: 2.0
## Median : 2 Mode :character Mode :character Median : 3.0
## Mean : 899 Mean : 9.1
## 3rd Qu.:2034 3rd Qu.: 8.0
## Max. :3723 Max. :230.0
## discount origin
## Min. : 0.0 Min. :1000
## 1st Qu.: 0.0 1st Qu.:5000
## Median : 0.0 Median :5000
## Mean : 3.7 Mean :4659
## 3rd Qu.: 0.0 3rd Qu.:5000
## Max. :220.0 Max. :8500
Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.
head(seg_data)
## offer_id campaign varietal min_qty discount origin
## 1 0 January C 2 0 1000
## 2 2185 March I 12 0 2000
## 3 3687 August C 6 0 3000
## 4 0 April C 32 0 5000
## 5 0 April I 3 0 5000
## 6 10 March T 52 0 5000
The cluster is created here using randomly generated numbers for reproducible research. Offers 939 and 957 are active in Delaware.
library(class)
set.seed(11111)
auto_subset <- seg_data[1:600, c(1,6)]
clusters <- kmeans(auto_subset, 2)
clusters
## K-means clustering with 2 clusters of sizes 588, 12
##
## Cluster means:
## offer_id origin
## 1 880 4594
## 2 1784 7867
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1
## 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 91 92 93 94 95 96 97 98 99 100
## 1 1 1 1 1 1 1 1 1 1
## [ reached getOption("max.print") -- omitted 500 entries ]
##
## Within cluster sum of squares by cluster:
## [1] 1476968519 17356878
## (between_SS / total_SS = 8.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Within cluster sum of squares by cluster measures how closely related are objects in a cluster.
A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
par(mfrow=c(1, 1))
plot(auto_subset$offer_id, auto_subset$origin, col=clusters$cluster, pch=20, cex=2)
points(clusters$centers, col="blue", pch=17, cex=3)
This table shows a count of minimum quantity versus the offers in the output. Several offers in Prin 5000, the State of Delaware, are active.
seg_data2 <- read.csv('seg_data2.csv')
knitr::kable(
seg_data2[1:5, 1:2], caption="A subset of offers."
)
A subset of offers.
| offer_id | total |
|---|---|
| 939 | 10 |
| 957 | 10 |
| 1874 | 10 |
| 2102 | 12 |
| 2185 | 11 |
K-means clustering is a fast and easy way to group data based on similarities in data.
“Essentially, all models are wrong, but some are useful.”
ALTER SESSION SET CURRENT_SCHEMA=OPS$MDC;
SELECT OFFER_ID_OCI "offer_id",
(CASE
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '1%' THEN 'January'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '2%' THEN 'February'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '3%' THEN 'March'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '4%' THEN 'April'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '5%' THEN 'May'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '6%' THEN 'June'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '7%' THEN 'July'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '8%' THEN 'August'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '9%' THEN 'September'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '10%' THEN 'October'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '11%' THEN 'November'
WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '12%' THEN 'December'
ELSE 'ERROR'
END) "campaign",
LOB_ACT_OCI "varietal", COUNT(OFFER_ID_OCI) "min_qty", SUM(DSC_AMT_ITV) "discount", PRIN_OCI "origin"
FROM OCI_CUR_ITEM, ITV_ITEM_VALUE
WHERE SUB_ACCT_NO_OCI = SUB_ACCT_NO_ITV
AND CUST_ACCT_NO_OCI = CUST_ACCT_NO_ITV
AND ROWNUM < 5463
GROUP BY OFFER_ID_OCI, LOB_ACT_OCI, BILL_START_DTE_OCI, OFFER_ID_OCI, DSC_AMT_ITV, PRIN_OCI;
I would like to thank Wairy for allowing me to follow my dreams.