class: center, middle, inverse, title-slide .title[ #
eCOTS 2022 e-Conference on Teaching Data Mining
] .author[ ###
Hosted by St Cloud State University
] .date[ ###
05/14/2022
] --- class: inverse ## Agenda (15 minutes) 1. Hello from hosts and goals of the conference (1 minute) 2. History of our data mining course (3 minutes, by Zhang) 3. Polling the audience (2 minutes, by Zhang) 4. Introducing our speakers (5 minutes, by Li) 5. Brief introduction to data mining (4 minutes, by Zhang) 6. First talk starts --- ## Goals of The Conference 1. To share ideas on teaching data mining to prepare the modern student in general 2. To share our practice on teaching data mining methods in particular 3. To hear what applied data scientists would say about data mining methods 4. To throw bricks in order to attract jade ## History of Our Data Mining Course (3 minutes, by Shiju Zhang) - Initially created by the *Information Systems* department. It was titled *Decision Support System* and first Launched in 2004. - The creator (Dr. Rich Sundheim) moved to the Statistics unit in 2012. He taught the same course under the name *Data Mining*. - The enrollment was usually around 25, mostly statistics majors. - Taught with SAS Enterprise Miner for many years and with JMP Pro for a few years, and recently with R - Topics we covered include the following <pre> Data Partition, Data Cleaning, Data Normalization, Data Visualization, Principal Component Analysis, Logistic Regression, Lift Chart, Classification & Regression Trees, Ensembles (Random Forest, Boosted Tree), Text Mining, Neural Network, K-Nearest Neighbors, Clustering Methods, Uplift Modeling, Naive Bayes Classifier, Support Vector Machine </pre> --- ## Polling the Audience (2 minutes) Please click the "Polling" icon at the bottom of your Zoom screen. The wait for the host to launch the poll. (by Zhang) --- ## Introducing Our Speakers (5 minutes, by Xiaoyin Li) - Dr. Ibrahim Soumare - Dr. Xiaoyin Li - Mr. Simeon Paynter - Dr. Mengshi Zhou - Dr. Xinliang Zhu - Dr. Cheng Peng - Dr. Dan Liu - Dr. Shiju Zhang --- ## What Is Data Mining - Data mining is the process of turning data into information. - Data mining is the process of digging through data to gain useful information to inform decision-making. - Data mining is the process of exploring data for patterns, trends, or relationships, clustering data for segmentation, and modeling data for prediction or classification. --- ## Challenges in Data Mining (0) Ethical or even legal issues (1) Incomplete data (2) Complex data (3) Performance of model (4) Scalability and Efficiency of algorithm (5) Domain knowledge (6) Data visualization Dr. Zhu and Dr. Peng will mention complex data and scalability. I will speak of performance and Dr. Xiaoyin Li will demonstrate examples on data visualization. --- ## Stages in Data Mining: (1) Make sense of the goal to achieve. (2) Collect and store data. (3) Prepare data to be ready for analysis, including partition & normalization. (4) Apply statistical or other tools and evaluate the model if any. (5) Make a recommendation. Our afternoon speaker Dr. Cheng Peng will talk about this at a business angle. --- ## Our First Speaker ### Statistical Consulting & Research Center by Dr. Ibrahim Soumare from SCSU --- ## A Review of R Programming Basics, Data Cleaning, and Data Visualization by Dr. Xiaoyin Li from SCSU --- ## My Data Science Journey with St Cloud State University by Mr. Simeon Paynter, a current SCSU student --- ## Forecasting with Time Series Data by Dr. Mengshi Zhou from SCSU --- ## One-Hour Break The Zoom may be ended and restarted @12:30 pm. The afternoon session starts @12:30 pm --- ## Afternoon Session Starts @12:30 pm. Ends @4:30 pm. --- ## Lessons Leart from Data-driven Projects by Dr. Xinliang Zhu from Amazon --- ## An End-to-end Project-based Approach to Teaching Data Mining ### A Case Study in Credit Card Fraud Detection by Dr. Cheng Peng from West Chster University --- ## Teaching Logistic Regression Models ### ROC Curves and Lift Charts for Evaluationg Model Predictive Performance by Dr. Shiju Zhang from SCSU <br> <br> We will need to start a review of the Ordinary multiple linear regression model. --- ## The Ordinary Multiple Linear Regression Model Data: -The response variable `\(y\)`, which is quantitative, and -A set of explanatory variables `\(x_1, x_2, \cdots, x_k\)` (can be of any type) Model: `$$y = \beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k + \epsilon$$` The model may contain some extra terms that are known functions of the `\(x\)`-terms. For example, when k = 3, the model may be `\begin{aligned} y = &\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 +\beta_3\cdot x_3+\\ & \beta_4\cdot x_1^2 +\beta_5\cdot x_2^2 +\beta_6\cdot x_3^2+\\ & \beta_7\cdot x_1x_2 +\beta_8\cdot x_1x_3 +\beta_9x_2 x_3+\epsilon \end{aligned}` --- ## Estimating the Coefficients Under the assumption that the error term `\(\epsilon\)` is normally distributed, the general model can be written as `$$Y\sim N(\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k, \sigma^2)$$` - When a set of `\(n\)` observations is available, each `\(Y_i\)` has a normal distribution with a mean depending on the observation's x-variables and has a common variance `\(\sigma^2\)`. - The joint density of `\(Y_1, Y_2, \cdots, Y_n\)` equals the product of the probability density function of each individual `\(y\)`-term. - When treated as a function of the parameters `\(\beta\)`'s and `\(\sigma\)`, the joint density is called the likelihood of `\(y\)` data. - Maximizing this likelihood to find the optimal set of parameter values becomes a mathematics problem. --- ## The Logistic Regression Model If a random variable Y takes 2 possible values with `\(p = P(Y=1)\)`, Y is said to have a *Bernoulli distribution* with parameter `\(p\)`. When a response variable can be modeled by the Bernoulli distribution with `\(p\)` depending on the linear combination of a set of explanatory variables `\(x_1, x_2, \cdots, x_k\)`, and/or their functions, the model is called the *logistic regression Model*. The model in terms of `\(p\)` has the form: `$$p(Y=1|x_1,x_2,\cdots,x_k)=\frac{1}{1+e^{-(\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k)}}$$` or, `$$logit(p(Y=1|x_1,x_2,\cdots,x_k))=\beta_0 + \beta_1\cdot x_1 +\beta_2\cdot x_2 + \cdots +\beta_k\cdot x_k$$` <br> where `\(logit(p)=ln(\frac{p}{1-p})\)`. When data are available, the likelihood can be easily written down. Maximizing this likelihood yields estimates of the coefficients. --- ## Partition Data into Training Set and Validation Set When fitting a model for prediction, Original data = training data (say 60% ) + validation data --- ## Prediction with the Model on New Data - A predictive model is trained with the training data. - Applied to the validation data to obtain predicted probabilities (also known as propensity scores). - Using an appropriate cutoff (say 0.5), the propensities can be converted to binary scores. - Each observation in the validation data can then be cross-classified with the observed target and the predicted target to form a 2-way table, called a *confusion matrix*. --- ## Confusion Matrix <pre> Suppose a 2x2 table with notation: Reference Predicted Event No Event Event A B No Event C D Performance measures: Sensitivity = A/(A+C), also called the Recall. Specificity = D/(B+D) Precision = A/(A+B), also called the Positive Predictive Value (PPV) False Discovery Rate (FDR) = 1 - Precision F1 = 2/(1/Recall + 1/Precision) </pre> --- ## Playing with the Logistic Regression Model with Simulated Data The true logistic regression model is `\begin{aligned} logit(p(Y=1|x_1,x_2,x_3,x_4,x_5,ct)) = & -1 + 1\cdot x_1 - 2\cdot x_2 + 3\cdot x_3 - 4\cdot x_4 + \\ & 5\cdot x_5 + 6\cdot ct - 7\cdot x_4\cdot ct \end{aligned}` --- ## Simulated Data ```r # Simulate x variables set.seed(123) n = 2000 # Try also 5000 X = data.frame(x1 = runif(n), x2 = runif(n), x3 = runif(n), x4 = runif(n), x5 = runif(n), ct = rbinom(n, 1, 0.5) ) # Assume the true model is p = 1/(1+exp(-(-1 + 1*X$x1 - 2*X$x2 + 3*X$x3 - 4*X$x4 + 5*X$x5 + 6*X$ct - 7*X$x4*X$ct))) # Simulate the responses y = sapply(1:n, function(k){rbinom(1, 1, p[k])}) D = cbind(X, y) ``` --- ## Partition of Data ```r index = sample(c(TRUE, FALSE), 0.75*n, replace = TRUE) train = D[index, ] valid = D[-index, ] ``` --- ## Fit Logistic Model with Interactions between Predictors and the Group Variable ```r m1 = glm(y~. + x1:ct + x2:ct + x3:ct + x4:ct + x5:ct, data = train, family = binomial) ``` --- ### (First 10 rows of 2000 simulated observations) ``` ## x1 x2 x3 x4 x5 ct y ## 1 0.2875775 0.15967398 0.304464185 0.44026102 0.57791006 0 1 ## 2 0.7883051 0.14451585 0.832818782 0.39739215 0.48884543 0 1 ## 3 0.4089769 0.14918039 0.593647508 0.37154765 0.74219044 1 1 ## 4 0.8830174 0.51443426 0.807196641 0.52880857 0.22560899 0 0 ## 5 0.9404673 0.49282731 0.294050778 0.07378541 0.74889107 0 1 ## 6 0.0455565 0.61634277 0.141085222 0.71684977 0.82486851 0 0 ## 7 0.5281055 0.44742289 0.888621103 0.24281235 0.09954338 1 1 ## 8 0.8924190 0.05567672 0.008285004 0.84449987 0.27587137 1 0 ## 9 0.5514350 0.00539631 0.569120653 0.99531827 0.02137678 1 0 ## 10 0.4566147 0.22183420 0.967550190 0.10501058 0.59624297 0 1 ``` --- ## Model with Interaction Terms between Explanatory Variables and Treatment ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.86111284 0.4560675 -1.88812600 5.900904e-02 ## x1 0.53920340 0.3984512 1.35324829 1.759763e-01 ## x2 -2.05491341 0.4177177 -4.91938283 8.681752e-07 ## x3 2.85132981 0.4617730 6.17474294 6.627121e-10 ## x4 -3.58523134 0.4711803 -7.60904372 2.761313e-14 ## x5 4.68326678 0.4975917 9.41186642 4.874039e-21 ## ct 6.86346136 1.0534241 6.51538292 7.250446e-11 ## x1:ct 0.67930082 0.7749404 0.87658462 3.807123e-01 ## x2:ct 0.04723765 0.7604555 0.06211757 9.504692e-01 ## x3:ct 0.78376832 0.8393018 0.93383366 3.503897e-01 ## x4:ct -10.51726832 1.6314670 -6.44650995 1.144551e-10 ## x5:ct 2.32197735 1.0533203 2.20443626 2.749368e-02 ``` --- ## ROC Curve <!-- --> Overfitting? --- ## Fit the True Model ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.2767129 0.4057438 -3.146599 1.651814e-03 ## x1 0.7346306 0.3394409 2.164237 3.044617e-02 ## x2 -2.0319569 0.3460876 -5.871221 4.325972e-09 ## x3 3.1273887 0.3831077 8.163210 3.262373e-16 ## x4 -3.8402650 0.4732666 -8.114381 4.882685e-16 ## x5 5.3022708 0.4373898 12.122531 8.024241e-34 ## ct 7.2326042 0.8168986 8.853735 8.463801e-19 ## x4:ct -8.4742229 1.1349092 -7.466873 8.212314e-14 ``` --- ## ROC Curve Based on the True Model <!-- --> Overfitting? --- ## Lift Charts <!-- --> --- ## Gains Table Here is a table based on which the 2 lift charts are constructed. ``` ## Depth Cume Cume Pct Mean ## of Cume Mean Mean of Total Lift Cume Model ## File N N Resp Resp Resp Index Lift Score ## ------------------------------------------------------------------------- ## 10 199 199 1.00 1.00 15.1% 151 151 1.00 ## 20 200 399 1.00 1.00 30.2% 151 151 1.00 ## 30 200 599 0.99 1.00 45.2% 150 151 0.98 ## 40 200 799 0.93 0.98 59.3% 141 148 0.92 ## 50 200 999 0.80 0.94 71.3% 120 143 0.84 ## 60 200 1199 0.71 0.90 82.1% 107 137 0.71 ## 70 200 1399 0.55 0.85 90.3% 82 129 0.54 ## 80 200 1599 0.41 0.80 96.5% 62 121 0.35 ## 90 200 1799 0.20 0.73 99.5% 30 111 0.17 ## 100 200 1999 0.04 0.66 100.0% 5 100 0.04 ``` --- ## Questions? --- ## Next Talk -> --- ## My Data Science Life by Dr. Dan Liu from Johns Hopkins University --- ## Teaching Uplift Models for A/B Testing by Dr. Shiju Zhang from SCSU --- ## Introduction Uplift models are getting popular in marketing to determine whether a promotional offer should be send to a customer. In political campaigns, it can be used to determine whether or not to send a registered voter a persuasion message. The modeling method is used for the situation where subjects in one group receive a treatment of interest and subjects in the other group (control) receive no treatment. The groups are formed randomly. The procedure in industry is called *A/B testing*. The response/target variable is usually a binary variable (say buy or not). Traditional models with a binary target variable and with the treatment plus a set of explanatory variables can only allow the group effect to be estimated, so it only tells whether the treatment is effective overall, and it can't allow the estimate of the effect of the treatment on an individual subject with a given profile. The market is calling for personalized sales strategy. The medical community calls for personalized treatment. Universities want to target those high school graduates who are more likely to submit an application for admission in response to a direct mail. Uplift models may help these dreams come true. --- ## The Data Structure ```r head(D) ``` ``` ## x1 x2 x3 x4 x5 ct y ## 1 0.2875775 0.1596740 0.3044642 0.44026102 0.5779101 0 1 ## 2 0.7883051 0.1445159 0.8328188 0.39739215 0.4888454 0 1 ## 3 0.4089769 0.1491804 0.5936475 0.37154765 0.7421904 1 1 ## 4 0.8830174 0.5144343 0.8071966 0.52880857 0.2256090 0 0 ## 5 0.9404673 0.4928273 0.2940508 0.07378541 0.7488911 0 1 ## 6 0.0455565 0.6163428 0.1410852 0.71684977 0.8248685 0 0 ``` where `\(ct = 1\)` (treatment) or 0 (control) and `\(x\)`'s are profile variables. --- ## The Big Idea of Uplift Modeling Given a subject whose profile is `\(x\)`, we want to estimate <strong>the difference in the probability of the target being "1" (the event of interest) between the condition that the subject is given the treatment and the condition that the subject is given the control</strong>: `$$P(Y=1|x, treatment)-P(Y=1|x,control)$$` There are at least 3 models available in the literature. We focus on the one-model approach. --- ## The One-Model Approach - Fit a predictive model (logistic regression/decision tree/k-NN/nn, etc.) based on the training data with a treatment variable and other features as explanatory variables. - Compute the predicted values based on the validation data. - Reverse the value of the treatment variable in the validation data and re-compute the predicted values based on the validation data. - Estimate the uplift for each individual by subtracting the two predictions for each subject (treatment minus control). - Add the uplift values to the validation data as a new column, say uplift. - Rank all individuals in the validation data in descending order by this new column. - The user is ready to send persuasion messages to those subjects who have relative larger uplift scores. --- ## Segmentation of Individuals In general, subjects can be divided into 4 clusters: sure thing, persuadable, lost cause, and sleeping dog. <span style = "color:blue">sure thing</span>: those who always respond positively <span style = "color:green">persuadable</span>: those who only respond when treated <span style = "color:pink">lost cause<span>: those who always respond negatively <span style = "color:red">sleeping dog</span>: whose who will not respond positively if treated. --- ## An Example Using Uplift Models A campaign director conducts a survey of 10000 voters to determine their inclination to vote Democratic. -First, the 10000 voters are split into two groups of 5000 each. -A message promoting Smith is mailed to each individual in the first group (the treatment group, indicated by 1). No message is mailed to individuals in the second group (the control group, indicated by 0). -The goal is to measure the change in opinion after the message is sent out, relative to the no-message control group. A post-message survey of the same sample of 10000 voters is then conducted to measure whether each voter's opinion of Smith has shifted in a positive direction. -A binary variable, Moved_AD, in the data indicates whether opinion has moved in a Democratic direction (1) or not (0). --- ## Data Dictionary The variables that will be used from the dataset are: * Age: Voter age in years * NH_White: Neighborhood average of % non-Hispanic white in household * Comm_PT: Neighborhood % of workers who take public transit * H_F1: Single female household (1 = yes) * Reg_Days: Days since voter registered at current address * PR_Pelig: Voted in what % of non-presidential primaries * E_Pelig: Voted in what % of any primaries * Political_C: Is there a political contributor in the home? (1 = yes) --- ## A Glimpse of the Data: ``` ## MOVED_AD AGE NH_WHITE COMM_PT H_F1 REG_DAYS PR_PELIG E_PELIG POLITICALC ## 1 N 28 61 0 0 3997 0 20 1 ## 2 N 53 87 1 0 258 0 0 0 ## 3 Y 68 23 11 0 4217 33 50 1 ## 4 N 66 53 2 0 3434 0 50 1 ## 5 N 23 74 2 0 1215 0 50 0 ## 6 Y 49 64 4 0 10899 67 80 1 ## 7 Y 59 82 0 0 242 0 0 0 ## 8 N 37 97 0 0 4136 0 70 1 ## 9 Y 23 67 3 0 300 0 0 1 ## 10 N 62 54 0 0 10879 33 60 0 ## MESSAGE_A ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 1 ## 9 1 ## 10 1 ``` --- ## A Summary of the Data ``` ## MESSAGE_A MOVED_AD ## 1 0 0.3444 ## 2 1 0.4024 ``` For those voters (2988 + 2012 = 5000) who received a message promoting Smith, 2012/5000 = 40.2% them moved in a Democratic direction, while this number was 34.4% for those who did not receive such a message. Overall, the lift from the message (treatment) is `\(40.2\%-34.4\%\)` or 5.8%. --- ## Partition of Data --- # R code for fitting the uplift model ```r # use upliftRF to apply a Random Forest (alternatively use upliftKNN() to apply kNN). up.fit <- upliftRF(formula = MOVED_AD_NUM ~ AGE + NH_WHITE + COMM_PT + H_F1 + REG_DAYS + PR_PELIG + E_PELIG + POLITICALC + trt(MESSAGE_A), ## Note the use of trt() function data = train.df, mtry = 3, # the number of variables tested at each node; # the default is floor(sqrt(ncol(x))). ntree = 20, # the number of trees used; default is ntree = 100 split_method = "KL", # the split criteria used at each node of each tree minsplit = 20, # the minimum number of observations that must exist in # a node in order for a split to be attempted. ) # print status messages? ``` --- ## The Model Output ``` ## pr.y1_ct1 pr.y1_ct0 ## [1,] 0.291040 0.248800 ## [2,] 0.601270 0.448030 ## [3,] 0.369915 0.507075 ## [4,] 0.342565 0.265130 ## [5,] 0.319870 0.432210 ## [6,] 0.384585 0.437230 ## [7,] 0.430405 0.302535 ## [8,] 0.335275 0.344750 ## [9,] 0.279235 0.471150 ## [10,] 0.541415 0.537565 ## [11,] 0.460465 0.176285 ## [12,] 0.457430 0.449760 ## [13,] 0.379690 0.168515 ## [14,] 0.306305 0.173840 ## [15,] 0.336570 0.427255 ## [16,] 0.433335 0.341815 ## [17,] 0.418505 0.231195 ## [18,] 0.646690 0.568020 ## [19,] 0.454565 0.237660 ## [20,] 0.290615 0.410370 ``` --- ## The Qini Plot for Assessing the Performance of Uplift Models ``` ## $Qini ## [1] 0.004835011 ## ## $inc.gains ## [1] -0.001965500 0.009071502 0.027605257 0.030137508 0.039176510 ## [6] 0.058214515 0.062750766 0.060285515 0.066324517 0.068367017 ## ## $random.inc.gains ## [1] 0.006836702 0.013673403 0.020510105 0.027346807 0.034183509 0.041020210 ## [7] 0.047856912 0.054693614 0.061530315 0.068367017 ``` <!-- --> --- ## Ranked Individuals The top 20: They could be persuadable. ``` ## voter_ID pr.y1_ct1 pr.y1_ct0 uplift ## 1 4315 0.556410 0.157785 0.398625 ## 2 7006 0.596355 0.205465 0.390890 ## 3 3524 0.589615 0.212450 0.377165 ## 4 3505 0.598285 0.223145 0.375140 ## 5 439 0.584915 0.216905 0.368010 ## 6 7911 0.578975 0.214340 0.364635 ## 7 4143 0.549075 0.187965 0.361110 ## 8 4460 0.492885 0.135970 0.356915 ## 9 5580 0.562140 0.211510 0.350630 ## 10 1121 0.522980 0.173680 0.349300 ## 11 5575 0.473190 0.124820 0.348370 ## 12 3251 0.484735 0.137105 0.347630 ## 13 9109 0.439470 0.097280 0.342190 ## 14 9844 0.711330 0.369815 0.341515 ## 15 2091 0.485675 0.144190 0.341485 ## 16 4921 0.530065 0.189030 0.341035 ## 17 9535 0.735525 0.396380 0.339145 ## 18 247 0.476785 0.138330 0.338455 ## 19 7197 0.596625 0.258470 0.338155 ## 20 6062 0.543855 0.205775 0.338080 ``` --- ## Ranked Individuals The bottom 20: They could be sleeping dogs ``` ## voter_ID pr.y1_ct1 pr.y1_ct0 uplift ## 3981 2540 0.462545 0.773015 -0.310470 ## 3982 40 0.237165 0.547760 -0.310595 ## 3983 3153 0.254455 0.565780 -0.311325 ## 3984 3206 0.248625 0.565245 -0.316620 ## 3985 6079 0.187840 0.504465 -0.316625 ## 3986 4253 0.209550 0.527640 -0.318090 ## 3987 4140 0.223305 0.544325 -0.321020 ## 3988 4348 0.440680 0.762640 -0.321960 ## 3989 9568 0.403970 0.732230 -0.328260 ## 3990 3276 0.205545 0.537550 -0.332005 ## 3991 6913 0.322340 0.654945 -0.332605 ## 3992 1788 0.390730 0.730455 -0.339725 ## 3993 2039 0.201695 0.542640 -0.340945 ## 3994 3093 0.328060 0.681625 -0.353565 ## 3995 9958 0.310425 0.665260 -0.354835 ## 3996 8907 0.345795 0.708095 -0.362300 ## 3997 7011 0.316965 0.681175 -0.364210 ## 3998 6500 0.261540 0.625915 -0.364375 ## 3999 6838 0.179365 0.555965 -0.376600 ## 4000 785 0.269440 0.649415 -0.379975 ``` --- ## Questions? --- ## The End Thank you All! Thank eCOTS for providing the broadcasting platform!