3. Data Mining Steps

1. 问题描述及理解

2. 数据描述及理解

探索性数据分析
- 描述性分析
- 可视化分析

3. 数据准备

数据准备步骤是在数据理解的基础上对数据进行整理转换，以方便输入到建模算法中。具体任务如下：

数据的基本清洁
- 从数据格式上进行规范化，例如字符类型转换为数字类型，将不同单位的数值进行整理，将单位与数值分离，将若干不一致的缩写规范化等。
特征子集选择
- 在数据维度较高时，很可能有大量的特征和目标是无关或多余的，这样的高维数据会对模型结果造成影响。
- 运用业务知识排除无关变量; 通过各特征与目标变量之间的相关性来剔除多余变量
- 以主成分分析PCA为代表的特征抽取. prcomp() need to center and scale, preProcess() %>% predict()
- 正则化方法，即赋值给较无关的变量一个非常小的权重系数，而对有用的变量分配大的权重
数据转换
- 标准化 Centering and Scaling：将不同单位的变量转换为同一尺度，以避免基于距离的算法对大尺度的变量产生偏向. e1071::skewness() hist() lattice::histogram, caret::BoxCoxTrans() %>% predict(), MASS::boxcox() preProcess() %>% predict()…
- box-cox转换 Skewness Transformation：对于要服从正态分布的变量，需要做一些对数转换或幂转换才能达到要求. Scale(),caret::preProces() %>% predict()…
- 离散化：将连续数值进行分组归约
- 构造哑变量：将分类变量转为二元哑变量. caret::dummyVars() %>% predict()
数据构造
- 从现有特征衍生出新特征，多半是借助业务知识对变量进行线性组合或非线性组合，这样的变量可以更直接的暴露信息。
- 对缺失数据的删除和插补. impute::impute.knn(), preProcess() %>% predict

4. 数据建模

数据集分割 Data Split
- caret::createDataPartition()
模型调节(Model Tuning) – 模型的Tuning Parameter，避免Overfitting
数据重抽样方法(Data Split) – Bootstrap CV,估计出模型表现的平均结果
模型评估(Model Evaluation) – 模型表现好坏的准则，选择出Optional Tuning Parameter
模型选择(Model Selection) – 根据选出的最优调节参数，选择出Final MOdel

5. 模型评估

6. 模型部署

4. R Packages

Data Import/Export
- DBI
- RMysql
- XML
- Rcurl
Data Cleaning and Data Analysis
- plyr
- dplyr
- reshap
- reshap2
- data.table
- Hmisc
- Amelia
Data Visualization
- ggplot2
Data Analysis/Mining
- caret
- rpart
- glmnet
- gbm
- xgboost
- randomForest
- e1071
- pROC
- ROCR
Data Production
- shiny

5. Python Tools

6. Machine Learning Algorithms 实现

6.1 算法

6.2 Datasets

模拟生成的数据

模拟生成的数据产生自下面的模型 \begin{equation} \left\{ \begin{array}{ll} X \sim \mathcal{N}(0, I), \qquad Y|X=x\sim Bernoulli(p(x))\\ \log(\frac{p(x)}{1-p(x)}) = X^{T} \beta + \epsilon \end{array} \right. \end{equation} where the model’s \(\beta=(8, 1, 10, 3, 4, 6, 6, 4, 9, 4, 1, 7, 1, 9, 7, 3, 5, 4, 10, 4)\), \(I\) is a 10 dimensions identity matrix, the error term \(\epsilon\) follows standard normal distribution.

现实数据(German Credit Data)

iris data

PimaIndiansDiabetes2 Data Set(mlbench package)

6.2.1 Simulation Data Set

library(MASS)
I = diag(x = 10)
X = mvrnorm(n = 10000, mu = rep(0, 10), Sigma = I)
beta = c(8, 6, 7, 4, 3, 2, 1, 2, 6, 1)
epsilon = rnorm(n = 10000, mean = 0, sd = 1)
linear_y = X %*% beta + epsilon
p = 1 / (1 + exp(-linear_y))
y = rbinom(n = 10000, size = 1, prob = p)
X = as.data.frame(X)
names(X) = paste0(rep("x", 10), 1:10)
simulation = cbind(X, y)
head(simulation)

##           x1         x2         x3         x4          x5          x6
## 1 -1.6932736  1.6830843 -0.9623125  0.9361958  0.41037303  0.50358753
## 2 -1.4530986 -1.8877910 -0.8712916  1.5606637 -2.15229826  2.58383322
## 3 -1.0352852 -0.2828999  0.7361335 -0.1915065  0.80948616 -0.87006833
## 4  1.0964669 -1.1784492  0.7327553 -0.8175988 -0.51899757  0.05863843
## 5  1.2579615 -1.3201197 -1.0677687 -0.1280080  0.60316412  0.35847624
## 6  0.4700659  0.5466436  0.3385814  0.4856115 -0.03597478 -1.01484927
##           x7          x8         x9          x10 y
## 1  1.1937871 -0.47722195  1.1556607 -2.063981020 0
## 2  0.5857647  1.18473839 -0.8039231  0.002948212 0
## 3 -1.9221342  0.80276078 -1.4348059 -0.831977271 0
## 4 -0.1780196  0.50609122 -0.3736140 -2.162454528 0
## 5  1.0082632 -0.30488880  1.3697818  1.158582193 1
## 6 -0.2115944  0.06002148 -0.1596358 -0.244158547 1

str(simulation)

## 'data.frame':    10000 obs. of  11 variables:
##  $ x1 : num  -1.69 -1.45 -1.04 1.1 1.26 ...
##  $ x2 : num  1.683 -1.888 -0.283 -1.178 -1.32 ...
##  $ x3 : num  -0.962 -0.871 0.736 0.733 -1.068 ...
##  $ x4 : num  0.936 1.561 -0.192 -0.818 -0.128 ...
##  $ x5 : num  0.41 -2.152 0.809 -0.519 0.603 ...
##  $ x6 : num  0.5036 2.5838 -0.8701 0.0586 0.3585 ...
##  $ x7 : num  1.194 0.586 -1.922 -0.178 1.008 ...
##  $ x8 : num  -0.477 1.185 0.803 0.506 -0.305 ...
##  $ x9 : num  1.156 -0.804 -1.435 -0.374 1.37 ...
##  $ x10: num  -2.06398 0.00295 -0.83198 -2.16245 1.15858 ...
##  $ y  : int  0 0 0 0 1 1 1 0 0 1 ...

6.2.2 German Credit Data

credit = read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", 
                header = FALSE, 
                sep = "")
names(credit) = c(paste0(rep("x", 20), 1:20), "y")
credit$y = as.integer(credit$y) - 1
library(magrittr)
credit = sapply(credit, as.integer) %>%
    as.data.frame() %>%
    sapply(function(x) x/max(x)) %>%
    as.data.frame()
head(credit)

##     x1         x2  x3  x4         x5  x6  x7   x8   x9       x10  x11  x12
## 1 0.25 0.08333333 1.0 0.5 0.06344985 1.0 1.0 1.00 0.75 0.3333333 1.00 0.25
## 2 0.50 0.66666667 0.6 0.5 0.32300261 0.2 0.6 0.50 0.50 0.3333333 0.50 0.25
## 3 1.00 0.16666667 1.0 0.8 0.11376465 0.2 0.8 0.50 0.75 0.3333333 0.75 0.25
## 4 0.25 0.58333333 0.6 0.4 0.42781155 0.2 0.8 0.50 0.75 1.0000000 1.00 0.50
## 5 0.25 0.33333333 0.8 0.1 0.26432914 0.2 0.6 0.75 0.75 0.3333333 1.00 1.00
## 6 1.00 0.50000000 0.6 0.8 0.49147851 1.0 0.6 0.50 0.75 0.3333333 1.00 1.00
##         x13 x14       x15  x16  x17 x18 x19 x20 y
## 1 0.8933333   1 0.6666667 0.50 0.75 0.5 1.0 0.5 0
## 2 0.2933333   1 0.6666667 0.25 0.75 0.5 0.5 0.5 1
## 3 0.6533333   1 0.6666667 0.25 0.50 1.0 0.5 0.5 0
## 4 0.6000000   1 1.0000000 0.25 0.75 1.0 0.5 0.5 0
## 5 0.7066667   1 1.0000000 0.50 0.75 1.0 0.5 0.5 1
## 6 0.4666667   1 1.0000000 0.25 0.50 1.0 1.0 0.5 0

str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ x1 : num  0.25 0.5 1 0.25 0.25 1 1 0.5 1 0.5 ...
##  $ x2 : num  0.0833 0.6667 0.1667 0.5833 0.3333 ...
##  $ x3 : num  1 0.6 1 0.6 0.8 0.6 0.6 0.6 0.6 1 ...
##  $ x4 : num  0.5 0.5 0.8 0.4 0.1 0.8 0.4 0.2 0.5 0.1 ...
##  $ x5 : num  0.0634 0.323 0.1138 0.4278 0.2643 ...
##  $ x6 : num  1 0.2 0.2 0.2 0.2 1 0.6 0.2 0.8 0.2 ...
##  $ x7 : num  1 0.6 0.8 0.8 0.6 0.6 1 0.6 0.8 0.2 ...
##  $ x8 : num  1 0.5 0.5 0.5 0.75 0.5 0.75 0.5 0.5 1 ...
##  $ x9 : num  0.75 0.5 0.75 0.75 0.75 0.75 0.75 0.75 0.25 1 ...
##  $ x10: num  0.333 0.333 0.333 1 0.333 ...
##  $ x11: num  1 0.5 0.75 1 1 1 1 0.5 1 0.5 ...
##  $ x12: num  0.25 0.25 0.25 0.5 1 1 0.5 0.75 0.25 0.75 ...
##  $ x13: num  0.893 0.293 0.653 0.6 0.707 ...
##  $ x14: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ x15: num  0.667 0.667 0.667 1 1 ...
##  $ x16: num  0.5 0.25 0.25 0.25 0.5 0.25 0.25 0.25 0.25 0.5 ...
##  $ x17: num  0.75 0.75 0.5 0.75 0.75 0.5 0.75 1 0.5 1 ...
##  $ x18: num  0.5 0.5 1 1 1 1 0.5 0.5 0.5 0.5 ...
##  $ x19: num  1 0.5 0.5 0.5 0.5 1 0.5 1 0.5 0.5 ...
##  $ x20: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ y  : num  0 1 0 0 1 0 0 0 0 1 ...

6.2.3 iris Data Set

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

6.2.4 PimaIndiansDiabetes2 Data Set

set.seed(1)
data(PimaIndiansDiabetes2, package = "mlbench")
data <- PimaIndiansDiabetes2
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

# 标准化
preProcValue <- preProcess(data[, -9], method = c("center", "scale"))
scaleddata <- predict(preProcValue, data[, -9])
# YeoJohnson转换,是数据接近正态分布，并减弱异常值的影响
preProcbox <- preProcess(scaleddata, method = c("YeoJohnson"))
boxdata <- predict(preProcbox, scaleddata)
# 缺失值插补，使用装袋算法
library(ipred)
preProcimp <- preProcess(boxdata, method = "bagImpute")
procdata <- predict(preProcimp, boxdata)
procdata$class <- data[, 9]
head(procdata)

##     pregnant    glucose    pressure     triceps     insulin       mass
## 1  0.5284016  0.7595155 -0.03275471  0.52823166  0.31794703  0.1613449
## 2 -1.0902050 -1.4250227 -0.52420088 -0.01466832 -1.22004921 -0.9327976
## 3  0.8956985  1.5845945 -0.69028150 -1.01637841  0.02420507 -1.5205042
## 4 -1.0902050 -1.2509263 -0.52420088 -0.62258605 -0.67266546 -0.6791257
## 5 -1.5823833  0.4627602 -2.74133178  0.52823166  0.09906180  1.3218244
## 6  0.3065886 -0.1924995  0.12832774 -0.77660624 -0.41566226 -1.1067781
##     pedigree         age class
## 1  0.3807017  0.90154325   pos
## 2 -0.4347451 -0.20794297   neg
## 3  0.4674736 -0.11085497   pos
## 4 -1.3678689 -1.55542743   neg
## 5  1.7918753 -0.02068449   pos
## 6 -1.1705718 -0.31192824   neg

str(procdata)

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  0.528 -1.09 0.896 -1.09 -1.582 ...
##  $ glucose : num  0.76 -1.425 1.585 -1.251 0.463 ...
##  $ pressure: num  -0.0328 -0.5242 -0.6903 -0.5242 -2.7413 ...
##  $ triceps : num  0.5282 -0.0147 -1.0164 -0.6226 0.5282 ...
##  $ insulin : num  0.3179 -1.22 0.0242 -0.6727 0.0991 ...
##  $ mass    : num  0.161 -0.933 -1.521 -0.679 1.322 ...
##  $ pedigree: num  0.381 -0.435 0.467 -1.368 1.792 ...
##  $ age     : num  0.9015 -0.2079 -0.1109 -1.5554 -0.0207 ...
##  $ class   : Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Machine Learning

zfwang

2016年5月20日

1. What is Machine Learning ?

2. Machine Learning Algorithms

有监督学习

无监督学习

降维方法

3. Data Mining Steps

4. R Packages

5. Python Tools

6. Machine Learning Algorithms 实现

6.1 算法

6.2 Datasets

6.2.1 Simulation Data Set

6.2.2 German Credit Data

6.2.3 iris Data Set

6.2.4 PimaIndiansDiabetes2 Data Set