Introduction

The data given by x are clustered by the k-means method, which aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster center). My goal is to perform k-means clustering on a data matrix.

---
title: "An Example Using the K-Means Technique"
author: "Chris Rucker"
output:
  tufte::tufte_handout: default
  tufte::tufte_html: default
---

There are two goals of this handout:

Explore input data containing subscriber offers, campaigns, and lines of business information from Mediacom Communications Corporation;
Group these subscribers into 2 clusters based on their attributes.

Exploratory Data Analysis

Data in R are often stored in data frames, because they can store multiple types of data. (In R, data frames are more general than matrices, because matrices can only store one type of data.) This example highlights some common functions in R that I like to use to explore a data frame before I conduct any statistical analysis.

KEY MEMBERS OF KOBIE MARKETING11 Kobie, namely David Eppler and Meredith Tayshetye provided the inspiration for this handout.

Box-and-Whisker

Images and graphics play an integral role in visualizing data. For example:

Boxplot of Mediacom offers, by minimum quantity.

seg_data <- read.csv('seg_data.csv')

boxplot(seg_data$min_qty, col="red")
title("min_qty")

The box-and-whisker plot is an exploratory graphic, created by John W. Tukey, used to show the distribution of a dataset (at a glance). Think of the type of data you might use a histogram with, and the box-and-whisker (or box plot, for short) could probably be useful.

The top 50% of the group are represented by everything above the median (the red line). Those in the top 25% are shown by the top “whisker” and dots. Dots represent those who are a lot less than normal (outliers).

Structure

Compactly display the internal structure of the seg_data object, a diagnostic function and an alternative to summary (and to some extent, dput).

str(seg_data)

## 'data.frame':    600 obs. of  6 variables:
##  $ offer_id: int  0 2185 3687 0 0 10 0 3723 0 0 ...
##  $ campaign: chr  "January" "March" "August" "April" ...
##  $ varietal: chr  "C" "I" "C" "C" ...
##  $ min_qty : int  2 12 6 32 3 52 4 18 9 1 ...
##  $ discount: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ origin  : int  1000 2000 3000 5000 5000 5000 5000 5000 5000 5000 ...

Summary Statistics

The pastecs package has methods that provide descriptive statistics on a data frame, and summary is a generic function used to produce result summaries of the results of various model fitting functions.

require(pastecs)

## Loading required package: pastecs

## Loading required package: boot

options(scipen=100)
options(digits=3)
stat.desc(seg_data)

##                offer_id campaign varietal  min_qty discount      origin
## nbr.val          600.00       NA       NA  600.000  600.000     600.000
## nbr.null         294.00       NA       NA    0.000  517.000       0.000
## nbr.na             0.00       NA       NA    0.000    0.000       0.000
## min                0.00       NA       NA    1.000    0.000    1000.000
## max             3723.00       NA       NA  230.000  220.000    8500.000
## range           3723.00       NA       NA  229.000  220.000    7500.000
## sum           539110.00       NA       NA 5462.000 2205.720 2795400.000
## median             2.00       NA       NA    3.000    0.000    5000.000
## mean             898.52       NA       NA    9.103    3.676    4659.000
## SE.mean           46.12       NA       NA    0.797    0.667      49.072
## CI.mean.0.95      90.58       NA       NA    1.565    1.311      96.375
## var          1276218.90       NA       NA  380.774  267.180 1444860.434
## std.dev         1129.70       NA       NA   19.513   16.346    1202.023
## coef.var           1.26       NA       NA    2.144    4.446       0.258

summary(seg_data)

##     offer_id      campaign           varietal            min_qty     
##  Min.   :   0   Length:600         Length:600         Min.   :  1.0  
##  1st Qu.:   0   Class :character   Class :character   1st Qu.:  2.0  
##  Median :   2   Mode  :character   Mode  :character   Median :  3.0  
##  Mean   : 899                                         Mean   :  9.1  
##  3rd Qu.:2034                                         3rd Qu.:  8.0  
##  Max.   :3723                                         Max.   :230.0  
##     discount         origin    
##  Min.   :  0.0   Min.   :1000  
##  1st Qu.:  0.0   1st Qu.:5000  
##  Median :  0.0   Median :5000  
##  Mean   :  3.7   Mean   :4659  
##  3rd Qu.:  0.0   3rd Qu.:5000  
##  Max.   :220.0   Max.   :8500

Header Rows

Returns the first or last parts of a vector, matrix, table, data frame or function. Since head() and tail() are generic functions, they may also have been extended to other classes.

head(seg_data)

##   offer_id campaign varietal min_qty discount origin
## 1        0  January        C       2        0   1000
## 2     2185    March        I      12        0   2000
## 3     3687   August        C       6        0   3000
## 4        0    April        C      32        0   5000
## 5        0    April        I       3        0   5000
## 6       10    March        T      52        0   5000

Modeling & Prediction

⊕The cluster is created here using randomly generated numbers for reproducible research. Offers 939 and 957 are active in Delaware.

Load Package

library(class)

Set Seed

set.seed(11111)
auto_subset <- seg_data[1:600, c(1,6)]
clusters <- kmeans(auto_subset, 2)
clusters

## K-means clustering with 2 clusters of sizes 588, 12
## 
## Cluster means:
##   offer_id origin
## 1      880   4594
## 2     1784   7867
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1 
##  [ reached getOption("max.print") -- omitted 500 entries ]
## 
## Within cluster sum of squares by cluster:
## [1] 1476968519   17356878
##  (between_SS / total_SS =   8.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

⊕Within cluster sum of squares by cluster measures how closely related are objects in a cluster.

⊕A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

Create Matrix

par(mfrow=c(1, 1))
plot(auto_subset$offer_id, auto_subset$origin, col=clusters$cluster, pch=20, cex=2)
points(clusters$centers, col="blue", pch=17, cex=3)

Tables

This table shows a count of minimum quantity versus the offers in the output. Several offers in Prin 5000, the State of Delaware, are active.

seg_data2 <- read.csv('seg_data2.csv')

knitr::kable(
  seg_data2[1:5, 1:2], caption="A subset of offers."
)

A subset of offers.

offer_id	total
939	10
957	10
1874	10
2102	12
2185	11

Conclusions

K-means clustering is a fast and easy way to group data based on similarities in data.

“Essentially, all models are wrong, but some are useful.”

— George E.P. Box

SeQueL

ALTER SESSION SET CURRENT_SCHEMA=OPS$MDC;

SELECT OFFER_ID_OCI "offer_id",
    (CASE
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '1%' THEN 'January'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '2%' THEN 'February'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '3%' THEN 'March'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '4%' THEN 'April'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '5%' THEN 'May'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '6%' THEN 'June'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '7%' THEN 'July'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '8%' THEN 'August'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '9%' THEN 'September'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '10%' THEN 'October'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '11%' THEN 'November'
        WHEN EXTRACT(MONTH FROM BILL_START_DTE_OCI) LIKE '12%' THEN 'December'
        ELSE 'ERROR'
    END) "campaign",
LOB_ACT_OCI "varietal", COUNT(OFFER_ID_OCI) "min_qty", SUM(DSC_AMT_ITV) "discount", PRIN_OCI "origin"
FROM OCI_CUR_ITEM, ITV_ITEM_VALUE
WHERE SUB_ACCT_NO_OCI = SUB_ACCT_NO_ITV
AND CUST_ACCT_NO_OCI = CUST_ACCT_NO_ITV 
AND ROWNUM < 5463
GROUP BY OFFER_ID_OCI, LOB_ACT_OCI, BILL_START_DTE_OCI, OFFER_ID_OCI, DSC_AMT_ITV, PRIN_OCI;

I would like to thank Wairy for allowing me to follow my dreams.

Kobie Handout

An Implementation in R Markdown

Chris Rucker

2017-01-06