eCOTS 2022 e-Conference on Teaching Data Mining

class: center, middle, inverse, title-slide

.title[
# eCOTS 2022 e-Conference on Teaching Data Mining 
]
.author[
### Hosted by St Cloud State University 
]
.date[
### 05/14/2022 
]

---

class: inverse
## Agenda (15 minutes)

1. Hello from hosts and goals of the conference (1 minute)

2. History of our data mining course (3 minutes, by Zhang)

3. Polling the audience (2 minutes, by Zhang)

4. Introducing our speakers (5 minutes, by Li)

5. Brief introduction to data mining (4 minutes, by Zhang)

6. First talk starts

---
## Goals of The Conference

1. To share ideas on teaching data mining to prepare the modern student in general

2. To share our practice on teaching data mining methods in particular

3. To hear what applied data scientists would say about data mining methods

4. To throw bricks in order to attract jade

## History of Our Data Mining Course 
(3 minutes, by Shiju Zhang)

- Initially created by the *Information Systems* department. It was titled *Decision Support System* and first Launched in 2004.

- The creator (Dr. Rich Sundheim) moved to the Statistics unit in 2012. He taught the same course under the name *Data Mining*.

- The enrollment was usually around 25, mostly statistics majors.

- Taught with SAS Enterprise Miner for many years and with JMP Pro for a few years, and recently with R 
- Topics we covered include the following

<pre> 
 Data Partition, Data Cleaning, Data Normalization, 
 Data Visualization, Principal Component Analysis, 
 Logistic Regression, Lift Chart, 
 Classification & Regression Trees, 
 Ensembles (Random Forest, Boosted Tree), 
 Text Mining, Neural Network, K-Nearest Neighbors, 
 Clustering Methods, Uplift Modeling, 
 Naive Bayes Classifier, Support Vector Machine
</pre>

---
## Polling the Audience 
(2 minutes)

Please click the "Polling" icon at the bottom of your Zoom screen.

The wait for the host to launch the poll. (by Zhang)

---
## Introducing Our Speakers 
(5 minutes, by Xiaoyin Li)

- Dr. Ibrahim Soumare
- Dr. Xiaoyin Li
- Mr. Simeon Paynter
- Dr. Mengshi Zhou
- Dr. Xinliang Zhu
- Dr. Cheng Peng
- Dr. Dan Liu
- Dr. Shiju Zhang

---
## What Is Data Mining

- Data mining is the process of turning data into information.

- Data mining is the process of digging through data to gain useful information to inform decision-making.

- Data mining is the process of exploring data for patterns, trends, or relationships, clustering data for segmentation, and modeling data for prediction or classification.

---
## Challenges in Data Mining

(0) Ethical or even legal issues

(1) Incomplete data

(2) Complex data

(3) Performance of model

(4) Scalability and Efficiency of algorithm

(5) Domain knowledge

(6) Data visualization

Dr. Zhu and Dr. Peng will mention complex data and scalability.

I will speak of performance and Dr. Xiaoyin Li will demonstrate examples on data visualization.

---
## Stages in Data Mining:

(1) Make sense of the goal to achieve.

(2) Collect and store data.

(3) Prepare data to be ready for analysis, including partition & normalization.

(4) Apply statistical or other tools and evaluate the model if any.

(5) Make a recommendation.

Our afternoon speaker Dr. Cheng Peng will talk about this at a business angle.

---
## Our First Speaker 
### Statistical Consulting & Research Center
by Dr. Ibrahim Soumare from SCSU

---
## A Review of R Programming Basics, Data Cleaning, and Data Visualization
by Dr. Xiaoyin Li from SCSU

---
## My Data Science Journey with St Cloud State University
by Mr. Simeon Paynter, a current SCSU student

---
## Forecasting with Time Series Data
by Dr. Mengshi Zhou from SCSU

---
## One-Hour Break
The Zoom may be ended and restarted @12:30 pm.

The afternoon session starts @12:30 pm

---
## Afternoon Session 
Starts @12:30 pm.

Ends @4:30 pm.
---
## Lessons Leart from Data-driven Projects
by Dr. Xinliang Zhu from Amazon

---
## An End-to-end Project-based Approach to Teaching Data Mining
### A Case Study in Credit Card Fraud Detection
by Dr. Cheng Peng from West Chster University

---
## Teaching Logistic Regression Models
### ROC Curves and Lift Charts for Evaluationg Model Predictive Performance 
by Dr. Shiju Zhang from SCSU

We will need to start a review of the Ordinary multiple linear regression model.

---
## The Ordinary Multiple Linear Regression Model

Data:

-The response variable `$y$`, which is quantitative, and

-A set of explanatory variables `$x_1, x_2, \cdots, x_k$` (can be of any type)

Model:

`$$y = \beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k + \epsilon$$`
The model may contain some extra terms that are known functions of the `$x$`-terms. For example, when k = 3, the model may be

`\begin{aligned}
                  y = &\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 +\beta_3\cdot x_3+\\
                  & \beta_4\cdot x_1^2 +\beta_5\cdot x_2^2 +\beta_6\cdot x_3^2+\\
                 & \beta_7\cdot x_1x_2 +\beta_8\cdot x_1x_3 +\beta_9x_2 x_3+\epsilon 
\end{aligned}`

---
## Estimating the Coefficients

Under the assumption that the error term `$\epsilon$` is normally distributed, the general model can be written as

`$$Y\sim N(\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k, \sigma^2)$$`
- When a set of `$n$` observations is available, each `$Y_i$` has a normal distribution with a mean depending on the observation's x-variables and has a common variance `$\sigma^2$`.

- The joint density of `$Y_1, Y_2, \cdots, Y_n$` equals the product of the probability density function of each individual `$y$`-term.

- When treated as a function of the parameters `$\beta$`'s and `$\sigma$`, the joint density is called the likelihood of `$y$` data.

- Maximizing this likelihood to find the optimal set of parameter values becomes a mathematics problem.

---
## The Logistic Regression Model

If a random variable Y takes 2 possible values with `$p = P(Y=1)$`, Y is said to have a *Bernoulli distribution* with parameter `$p$`.

When a response variable can be modeled by the Bernoulli distribution with `$p$` depending on the linear combination of a set of explanatory variables `$x_1, x_2, \cdots, x_k$`, and/or their functions, the model is called the *logistic regression Model*. The model in terms of `$p$` has the form:

`$$p(Y=1|x_1,x_2,\cdots,x_k)=\frac{1}{1+e^{-(\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k)}}$$`
or,

`$$logit(p(Y=1|x_1,x_2,\cdots,x_k))=\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k$$`

where `$logit(p)=ln(\frac{p}{1-p})$`.

When data are available, the likelihood can be easily written down. Maximizing this likelihood yields estimates of the coefficients.

---
## Partition Data into Training Set and Validation Set
When fitting a model for prediction,

Original data  = training data (say 60% ) + validation data

---
## Prediction with the Model on New Data

- A predictive model is trained with the training data.

- Applied to the validation data to obtain predicted probabilities (also known as propensity scores).

- Using an appropriate cutoff (say 0.5), the propensities can be converted to binary scores.

- Each observation in the validation data can then be cross-classified with the observed target and the predicted target to form a 2-way table, called a *confusion matrix*.

---
## Confusion Matrix

<pre>
Suppose a 2x2 table with notation:

Reference	
Predicted	 Event	No Event
Event	       A	     B
No Event	    C	     D

Performance measures:

Sensitivity = A/(A+C), also called the Recall.

Specificity = D/(B+D)

Precision = A/(A+B), also called the Positive Predictive Value (PPV)

False Discovery Rate (FDR) = 1 - Precision

F1 = 2/(1/Recall + 1/Precision)

</pre>

---
## Playing with the Logistic Regression Model with Simulated Data

The true logistic regression model is

`\begin{aligned}
                  logit(p(Y=1|x_1,x_2,x_3,x_4,x_5,ct)) = & -1 + 1\cdot x_1 - 2\cdot x_2 + 3\cdot x_3 - 4\cdot x_4 + \\
                  & 5\cdot x_5 + 6\cdot ct - 7\cdot x_4\cdot ct 
\end{aligned}`

---
## Simulated Data

```r
# Simulate x variables
set.seed(123)

n = 2000 # Try also 5000
X = data.frame(x1 = runif(n),
               x2 = runif(n),
               x3 = runif(n),
               x4 = runif(n),
               x5 = runif(n),
               ct = rbinom(n, 1, 0.5)
              )
# Assume the true model is
p = 1/(1+exp(-(-1 + 1*X$x1 - 2*X$x2 + 3*X$x3 - 4*X$x4 + 5*X$x5 + 6*X$ct - 7*X$x4*X$ct)))

# Simulate the responses
y = sapply(1:n, function(k){rbinom(1, 1, p[k])})

D = cbind(X, y)
```

---
## Partition of Data

```r
index = sample(c(TRUE, FALSE), 0.75*n, replace = TRUE)
train = D[index, ]
valid = D[-index, ]
```

---
## Fit Logistic Model with Interactions between Predictors and the Group Variable

```r
m1 = glm(y~. + x1:ct + x2:ct + x3:ct + x4:ct + x5:ct, data = train, family = binomial)
```

---
### (First 10 rows of 2000 simulated observations)

```
##           x1         x2          x3         x4         x5 ct y
## 1  0.2875775 0.15967398 0.304464185 0.44026102 0.57791006  0 1
## 2  0.7883051 0.14451585 0.832818782 0.39739215 0.48884543  0 1
## 3  0.4089769 0.14918039 0.593647508 0.37154765 0.74219044  1 1
## 4  0.8830174 0.51443426 0.807196641 0.52880857 0.22560899  0 0
## 5  0.9404673 0.49282731 0.294050778 0.07378541 0.74889107  0 1
## 6  0.0455565 0.61634277 0.141085222 0.71684977 0.82486851  0 0
## 7  0.5281055 0.44742289 0.888621103 0.24281235 0.09954338  1 1
## 8  0.8924190 0.05567672 0.008285004 0.84449987 0.27587137  1 0
## 9  0.5514350 0.00539631 0.569120653 0.99531827 0.02137678  1 0
## 10 0.4566147 0.22183420 0.967550190 0.10501058 0.59624297  0 1
```

---
## Model with Interaction Terms between Explanatory Variables and Treatment

```
##                 Estimate Std. Error     z value     Pr(>|z|)
## (Intercept)  -0.86111284  0.4560675 -1.88812600 5.900904e-02
## x1            0.53920340  0.3984512  1.35324829 1.759763e-01
## x2           -2.05491341  0.4177177 -4.91938283 8.681752e-07
## x3            2.85132981  0.4617730  6.17474294 6.627121e-10
## x4           -3.58523134  0.4711803 -7.60904372 2.761313e-14
## x5            4.68326678  0.4975917  9.41186642 4.874039e-21
## ct            6.86346136  1.0534241  6.51538292 7.250446e-11
## x1:ct         0.67930082  0.7749404  0.87658462 3.807123e-01
## x2:ct         0.04723765  0.7604555  0.06211757 9.504692e-01
## x3:ct         0.78376832  0.8393018  0.93383366 3.503897e-01
## x4:ct       -10.51726832  1.6314670 -6.44650995 1.144551e-10
## x5:ct         2.32197735  1.0533203  2.20443626 2.749368e-02
```

---
## ROC Curve

![](data:image/png;base64,#DataMiningeCOTS2022_files/figure-html/unnamed-chunk-6-1.png)
Overfitting?

---
## Fit the True Model

```
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.2767129  0.4057438 -3.146599 1.651814e-03
## x1           0.7346306  0.3394409  2.164237 3.044617e-02
## x2          -2.0319569  0.3460876 -5.871221 4.325972e-09
## x3           3.1273887  0.3831077  8.163210 3.262373e-16
## x4          -3.8402650  0.4732666 -8.114381 4.882685e-16
## x5           5.3022708  0.4373898 12.122531 8.024241e-34
## ct           7.2326042  0.8168986  8.853735 8.463801e-19
## x4:ct       -8.4742229  1.1349092 -7.466873 8.212314e-14
```

---
## ROC Curve Based on the True Model

![](data:image/png;base64,#DataMiningeCOTS2022_files/figure-html/unnamed-chunk-8-1.png)
Overfitting?

---
## Lift Charts

![](data:image/png;base64,#DataMiningeCOTS2022_files/figure-html/unnamed-chunk-9-1.png)

---
## Gains Table
Here is a table based on which the 2 lift charts are constructed.

```
## Depth                            Cume   Cume Pct                     Mean
##  of           Cume     Mean      Mean   of Total    Lift   Cume     Model
## File     N      N      Resp      Resp      Resp    Index   Lift     Score
## -------------------------------------------------------------------------
##   10   199    199      1.00      1.00      15.1%     151    151      1.00
##   20   200    399      1.00      1.00      30.2%     151    151      1.00
##   30   200    599      0.99      1.00      45.2%     150    151      0.98
##   40   200    799      0.93      0.98      59.3%     141    148      0.92
##   50   200    999      0.80      0.94      71.3%     120    143      0.84
##   60   200   1199      0.71      0.90      82.1%     107    137      0.71
##   70   200   1399      0.55      0.85      90.3%      82    129      0.54
##   80   200   1599      0.41      0.80      96.5%      62    121      0.35
##   90   200   1799      0.20      0.73      99.5%      30    111      0.17
##  100   200   1999      0.04      0.66     100.0%       5    100      0.04
```

---
## Questions?

---
## Next Talk ->

---
## My Data Science Life
by Dr. Dan Liu from Johns Hopkins University

---
## Teaching Uplift Models for A/B Testing
by Dr. Shiju Zhang from SCSU

---
## Introduction

Uplift models are getting popular in marketing to determine whether a promotional offer should be send to a customer. In political campaigns, it can be used to determine whether or not to send a registered voter a persuasion message.

The modeling method is used for the situation where subjects in one group receive a treatment of interest and subjects in the other group (control) receive no treatment. The groups are formed randomly. The procedure in industry is called *A/B testing*. The response/target variable is usually a binary variable (say buy or not).

Traditional models with a binary target variable and with the treatment plus a set of explanatory variables can only allow the group effect to be estimated, so it only tells whether the treatment is effective overall, and it can't allow the estimate of the effect of the treatment on an individual subject with a given profile.

The market is calling for personalized sales strategy. The medical community calls for personalized treatment. Universities want to target those high school graduates who are more likely to submit an application for admission in response to a direct mail. Uplift models may help these dreams come true.

---
## The Data Structure

```r
head(D)
```

```
##          x1        x2        x3         x4        x5 ct y
## 1 0.2875775 0.1596740 0.3044642 0.44026102 0.5779101  0 1
## 2 0.7883051 0.1445159 0.8328188 0.39739215 0.4888454  0 1
## 3 0.4089769 0.1491804 0.5936475 0.37154765 0.7421904  1 1
## 4 0.8830174 0.5144343 0.8071966 0.52880857 0.2256090  0 0
## 5 0.9404673 0.4928273 0.2940508 0.07378541 0.7488911  0 1
## 6 0.0455565 0.6163428 0.1410852 0.71684977 0.8248685  0 0
```
where `$ct = 1$` (treatment) or 0 (control) and `$x$`'s are profile variables.

---
## The Big Idea of Uplift Modeling

Given a subject whose profile is `$x$`, we want to estimate the difference in the probability of the target being "1" (the event of interest) between the condition that the subject is given the treatment and the condition that the subject is given the control:

`$$P(Y=1|x, treatment)-P(Y=1|x,control)$$`

There are at least 3 models available in the literature. We focus on the one-model approach.

---
## The One-Model Approach

- Fit a predictive model (logistic regression/decision tree/k-NN/nn, etc.) based on the training data with a treatment variable and other features as explanatory variables.
  
- Compute the predicted values based on the validation data.
  
- Reverse the value of the treatment variable in the validation data and re-compute the predicted values based on the validation data.
  
- Estimate the uplift for each individual by subtracting the two predictions for each subject (treatment minus control).

- Add the uplift values to the validation data as a new column, say uplift.

- Rank all individuals in the validation data in descending order by this new column.

- The user is ready to send persuasion messages to those subjects who have relative larger uplift scores.

---
## Segmentation of Individuals

In general, subjects can be divided into 4 clusters: sure thing, persuadable, lost cause, and sleeping dog.

sure thing: those who always respond positively

persuadable: those who only respond when treated

lost cause: those who always respond negatively

sleeping dog: whose who will not respond positively if treated.

---
## An Example Using Uplift Models

A campaign director conducts a survey of 10000 voters to determine their inclination to vote Democratic.

-First, the 10000 voters are split into two groups of 5000 each.

-A message promoting Smith is mailed to each individual in the first group (the treatment group, indicated by 1). No message is mailed to individuals in the second group (the control group, indicated by 0).

-The goal is to measure the change in opinion after the message is sent out, relative to the no-message control group.

A post-message survey of the same sample of 10000 voters is then conducted to measure whether each voter's opinion of Smith has shifted in a positive direction.

-A binary variable, Moved_AD, in the data indicates whether opinion has moved in a Democratic direction (1) or not (0).

---
## Data Dictionary

The variables that will be used from the dataset are:

* Age: Voter age in years
  
  * NH_White: Neighborhood average of % non-Hispanic white in household
  
  * Comm_PT: Neighborhood % of workers who take public transit
  
  * H_F1: Single female household (1 = yes)
  
  * Reg_Days: Days since voter registered at current address
  
  * PR_Pelig: Voted in what % of non-presidential primaries
  
  * E_Pelig: Voted in what % of any primaries
  
  * Political_C: Is there a political contributor in the home? (1 = yes)
  
---
## A Glimpse of the Data:

```
##    MOVED_AD AGE NH_WHITE COMM_PT H_F1 REG_DAYS PR_PELIG E_PELIG POLITICALC
## 1         N  28       61       0    0     3997        0      20          1
## 2         N  53       87       1    0      258        0       0          0
## 3         Y  68       23      11    0     4217       33      50          1
## 4         N  66       53       2    0     3434        0      50          1
## 5         N  23       74       2    0     1215        0      50          0
## 6         Y  49       64       4    0    10899       67      80          1
## 7         Y  59       82       0    0      242        0       0          0
## 8         N  37       97       0    0     4136        0      70          1
## 9         Y  23       67       3    0      300        0       0          1
## 10        N  62       54       0    0    10879       33      60          0
##    MESSAGE_A
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1
## 6          1
## 7          1
## 8          1
## 9          1
## 10         1
```

---
## A Summary of the Data

```
##   MESSAGE_A MOVED_AD
## 1         0   0.3444
## 2         1   0.4024
```
  
For those voters (2988 + 2012 = 5000) who received a message promoting Smith, 2012/5000 = 40.2% them moved in a Democratic direction, while this number was 34.4% for those who did not receive such a message. Overall, the lift from the message (treatment) is `$40.2\%-34.4\%$` or 5.8%.

---
## Partition of Data

---
# R code for fitting the uplift model

```r
# use upliftRF to apply a Random Forest (alternatively use upliftKNN() to apply kNN). 
up.fit <- upliftRF(formula = MOVED_AD_NUM ~ AGE + NH_WHITE + COMM_PT + 
 H_F1 + REG_DAYS + PR_PELIG + E_PELIG + POLITICALC 
 + trt(MESSAGE_A), ## Note the use of trt() function 
 data = train.df, 
 mtry = 3, # the number of variables tested at each node;
 # the default is floor(sqrt(ncol(x))).
 ntree = 20, # the number of trees used; default is ntree = 100
 split_method = "KL", # the split criteria used at each node of each tree
 minsplit = 20, # the minimum number of observations that must exist in 
 # a node in order for a split to be attempted.
 ) # print status messages?
```

---
## The Model Output

```
##       pr.y1_ct1 pr.y1_ct0
##  [1,]  0.291040  0.248800
##  [2,]  0.601270  0.448030
##  [3,]  0.369915  0.507075
##  [4,]  0.342565  0.265130
##  [5,]  0.319870  0.432210
##  [6,]  0.384585  0.437230
##  [7,]  0.430405  0.302535
##  [8,]  0.335275  0.344750
##  [9,]  0.279235  0.471150
## [10,]  0.541415  0.537565
## [11,]  0.460465  0.176285
## [12,]  0.457430  0.449760
## [13,]  0.379690  0.168515
## [14,]  0.306305  0.173840
## [15,]  0.336570  0.427255
## [16,]  0.433335  0.341815
## [17,]  0.418505  0.231195
## [18,]  0.646690  0.568020
## [19,]  0.454565  0.237660
## [20,]  0.290615  0.410370
```

---
## The Qini Plot for Assessing the Performance of Uplift Models

```
## $Qini
## [1] 0.004835011
## 
## $inc.gains
##  [1] -0.001965500  0.009071502  0.027605257  0.030137508  0.039176510
##  [6]  0.058214515  0.062750766  0.060285515  0.066324517  0.068367017
## 
## $random.inc.gains
##  [1] 0.006836702 0.013673403 0.020510105 0.027346807 0.034183509 0.041020210
##  [7] 0.047856912 0.054693614 0.061530315 0.068367017
```

![](data:image/png;base64,#DataMiningeCOTS2022_files/figure-html/unnamed-chunk-17-1.png)

---
## Ranked Individuals
The top 20: They could be persuadable.

```
##    voter_ID pr.y1_ct1 pr.y1_ct0   uplift
## 1      4315  0.556410  0.157785 0.398625
## 2      7006  0.596355  0.205465 0.390890
## 3      3524  0.589615  0.212450 0.377165
## 4      3505  0.598285  0.223145 0.375140
## 5       439  0.584915  0.216905 0.368010
## 6      7911  0.578975  0.214340 0.364635
## 7      4143  0.549075  0.187965 0.361110
## 8      4460  0.492885  0.135970 0.356915
## 9      5580  0.562140  0.211510 0.350630
## 10     1121  0.522980  0.173680 0.349300
## 11     5575  0.473190  0.124820 0.348370
## 12     3251  0.484735  0.137105 0.347630
## 13     9109  0.439470  0.097280 0.342190
## 14     9844  0.711330  0.369815 0.341515
## 15     2091  0.485675  0.144190 0.341485
## 16     4921  0.530065  0.189030 0.341035
## 17     9535  0.735525  0.396380 0.339145
## 18      247  0.476785  0.138330 0.338455
## 19     7197  0.596625  0.258470 0.338155
## 20     6062  0.543855  0.205775 0.338080
```

---
## Ranked Individuals
The bottom 20: They could be sleeping dogs

```
##      voter_ID pr.y1_ct1 pr.y1_ct0    uplift
## 3981     2540  0.462545  0.773015 -0.310470
## 3982       40  0.237165  0.547760 -0.310595
## 3983     3153  0.254455  0.565780 -0.311325
## 3984     3206  0.248625  0.565245 -0.316620
## 3985     6079  0.187840  0.504465 -0.316625
## 3986     4253  0.209550  0.527640 -0.318090
## 3987     4140  0.223305  0.544325 -0.321020
## 3988     4348  0.440680  0.762640 -0.321960
## 3989     9568  0.403970  0.732230 -0.328260
## 3990     3276  0.205545  0.537550 -0.332005
## 3991     6913  0.322340  0.654945 -0.332605
## 3992     1788  0.390730  0.730455 -0.339725
## 3993     2039  0.201695  0.542640 -0.340945
## 3994     3093  0.328060  0.681625 -0.353565
## 3995     9958  0.310425  0.665260 -0.354835
## 3996     8907  0.345795  0.708095 -0.362300
## 3997     7011  0.316965  0.681175 -0.364210
## 3998     6500  0.261540  0.625915 -0.364375
## 3999     6838  0.179365  0.555965 -0.376600
## 4000      785  0.269440  0.649415 -0.379975
```

---
## Questions?

---
## The End

Thank you All!

Thank eCOTS for providing the broadcasting platform!