基于的脚本草稿是 1012_24.R;这是一个包含全部语法的完整脚本。
完成原始数据库的导入,原始数据库已经进行了清洗和缺失值查补,存储为R的数据库文件。采用
try
函数读取不同路径的数据库,感觉
paste
在单独运行正常,但是在代码块中运行出错,暂时没有解决!
这里同时提供了没有脚本文件需要的 preface 部分:包含内存的清楚和计时器功能。
进行数据准备。
变量选取,子样本的设定
基线描述统计
Characteristic | N = 8,5291 |
---|---|
健康老龄化指数 | 63.4 (14.2) |
健康生活习惯 | |
0 | 99 (1.2%) |
1 | 918 (11%) |
2 | 2,427 (28%) |
3 | 2,418 (28%) |
4 | 1,910 (22%) |
5 | 692 (8.1%) |
6 | 65 (0.8%) |
教育程度 | |
初级 | 7,298 (86%) |
中级 | 1,045 (12%) |
高级 | 186 (2.2%) |
年龄 | 56 (9) |
家庭人均消费对数 | 8.59 (0.82) |
城乡 | |
城市 | 2,017 (24%) |
农村 | 6,512 (76%) |
性别 | |
丈夫 | 4,486 (53%) |
妻子 | 4,043 (47%) |
与孩子联系 | |
否 | 569 (6.7%) |
是 | 7,960 (93%) |
孩子住在同城 | |
否 | 763 (8.9%) |
是 | 7,766 (91%) |
目前是否在工作 | |
否 | 2,504 (29%) |
是 | 6,025 (71%) |
调查轮次 | |
1 | 8,529 (100%) |
婚姻状况 | |
其他 | 1 (<0.1%) |
已婚 | 8,528 (100%) |
1 Mean (SD); n (%) |
Name | (df_1) |
Number of rows | 28815 |
Number of columns | 12 |
Key | ID, wave |
_______________________ | |
Column type frequency: | |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hai | 0 | 1 | 61.69 | 14.88 | 3.65 | 51.14 | 62.99 | 74.19 | 100.00 | ▁▂▆▇▁ |
hlif | 0 | 1 | 2.75 | 1.18 | 0.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▅▇▇▆▂ |
edu | 0 | 1 | 1.18 | 0.44 | 1.00 | 1.00 | 1.00 | 1.00 | 3.00 | ▇▁▁▁▁ |
age | 0 | 1 | 58.06 | 8.73 | 10.00 | 51.00 | 57.00 | 64.00 | 101.00 | ▁▁▇▂▁ |
logpercons | 0 | 1 | 9.02 | 0.90 | 6.62 | 8.43 | 9.01 | 9.60 | 11.39 | ▁▅▇▅▁ |
rural | 0 | 1 | 1.75 | 0.44 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | ▃▁▁▁▇ |
ragender | 0 | 1 | 1.48 | 0.50 | 1.00 | 1.00 | 1.00 | 2.00 | 2.00 | ▇▁▁▁▇ |
hkcnt | 0 | 1 | 1.92 | 0.27 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | ▁▁▁▁▇ |
neark | 0 | 1 | 1.78 | 0.41 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | ▂▁▁▁▇ |
rwork | 0 | 1 | 1.70 | 0.46 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | ▃▁▁▁▇ |
wave | 0 | 1 | 2.49 | 1.17 | 1.00 | 1.00 | 3.00 | 4.00 | 4.00 | ▇▅▁▇▇ |
marry | 0 | 1 | 2.00 | 0.03 | 1.00 | 2.00 | 2.00 | 2.00 | 2.00 | ▁▁▁▁▇ |
缺失值检查可视化
## character(0)
Name | df_2 |
Number of rows | 22457 |
Number of columns | 13 |
Key | NULL |
_______________________ | |
Column type frequency: | |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
hlif | 0 | 1 | 0.00 | 1.00 | -2.36 | -0.67 | 0.18 | 1.03 | 2.73 | ▃▇▇▆▂ |
age | 0 | 1 | 61.15 | 7.19 | 51.00 | 55.00 | 60.00 | 66.00 | 89.00 | ▇▇▃▁▁ |
logpercons | 0 | 1 | 0.00 | 1.00 | -2.60 | -0.65 | 0.00 | 0.64 | 2.60 | ▁▅▇▅▁ |
wave | 0 | 1 | 2.59 | 1.17 | 1.00 | 1.00 | 3.00 | 4.00 | 4.00 | ▇▅▁▇▇ |
edu_2 | 0 | 1 | 0.13 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
edu_3 | 0 | 1 | 0.02 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
rural_2 | 0 | 1 | 0.73 | 0.44 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
ragender_2 | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
hkcnt_2 | 0 | 1 | 0.91 | 0.28 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
neark_2 | 0 | 1 | 0.76 | 0.43 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
rwork_2 | 0 | 1 | 0.66 | 0.47 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
marry_2 | 0 | 1 | 1.00 | 0.03 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
hai | 0 | 1 | 0.00 | 1.00 | -3.78 | -0.71 | 0.07 | 0.81 | 2.64 | ▁▂▆▇▁ |
## [1] 13
## learning_rate max_depth bagging_fraction learner_param_vals x_domain
## <num> <int> <num> <list> <list>
## 1: 0.1271679 2 0.2608147 <list[6]> <list[3]>
## 1 variable(s) not shown: [regr.rmse <num>]
## [1] 0.25308353 -0.08498012 -0.09226821 -0.08469440 0.15989305 0.14784843
这里基于mlr3
,也可以调用lgb.train
八个随机样本老人的6个重要变量cp
八个随机样本老人的6个重要变量cp
## DALEX部分: 2.715 sec elapsed
基于mlr3
对象
at
来处理;多线程速度似乎比单线程还要慢?Mac.
这里计算交互项比较耗时,但还是有一定的解释力价值存在。
6个重要特征的ale
年龄的ice变化轨迹
特征的交互项贡献与年龄的交互项展开
决策树来替代黑盒模型
## iml部分: 692.826 sec elapsed
mlr3
对象 at
来处理
## Average loss
## [1] 0.8330394
## 1-way calculations...
## | | | 0% | |===== | 8% | |=========== | 15% | |================ | 23% | |====================== | 31% | |=========================== | 38% | |================================ | 46% | |====================================== | 54% | |=========================================== | 62% | |================================================ | 69% | |====================================================== | 77% | |=========================================================== | 85% | |================================================================= | 92% | |======================================================================| 100%
## 2-way calculations...
## | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## 'hstats' object. Use plot() or summary() for details.
##
## H^2 (normalized)
## [1] 0.03108107
## | | | 0% | |====== | 8% | |============ | 17% | |================== | 25% | |======================= | 33% | |============================= | 42% | |=================================== | 50% | |========================================= | 58% | |=============================================== | 67% | |==================================================== | 75% | |========================================================== | 83% | |================================================================ | 92% | |======================================================================| 100%
## hstats部分: 4.671 sec elapsed