好的数据集

找数据总是一件很烦的事情，这个文档致力于收好用的常用的数据集合。

网上搜集数据集

搜索“lasso real data” –> 找到 “prostate data(Stamey et.al)”,然后搜索它 –> 找到 http://www.stat.wisc.edu/~gvludwig/fall_2012/handout15.R –> 数据的使用方法 http://www.lisa.stat.vt.edu/sites/default/files/Model%20selection%20in%20R%20featuring%20the%20lasso.pdf

# The following dataset is from Hastie, Tibshirani and Friedman (2009), from a study 
# by Stamey et al. (1989) of prostate cancer, measuring the correlation between the level
# of a prostate-specific antigen and some covariates. The covariates are
#
# * lcavol  : log-cancer volume
# * lweight : log-prostate weight
# * age     : age of patient
# * lbhp    : log-amount of benign hyperplasia
# * svi     : seminal vesicle invasion
# * lcp     : log-capsular penetration
# * gleason : Gleason Score, check http://en.wikipedia.org/wiki/Gleason_Grading_System
# * pgg45   : percent of Gleason scores 4 or 5
#
# And lpsa is the response variable, log-psa.

url <- "http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data"
str(pcancer <- read.table(url, header=TRUE))

## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Haste 开发的软件包里面有各种需要的数据 http://web.stanford.edu/~hastie/swData.htm !!! ’ - lars 包 http://www.stanford.edu/~hastie/Papers/LARS/
- glmnet 包 http://web.stanford.edu/~hastie/glmnet/glmnetData/
搜索 “quantile regression real data”
- the labor pain data https://rdrr.io/rforge/Qtools/man/labor.html in package Qtools 或者其他lqmm 两个包都安装不了。。。。最后用github 才成功安装了lqmm!!!
- quantile regression for large-scale applications 的数据来自于网站 http://www.census.gov/census2000/PUMS5.html 不清楚它取了那些数据，不过数据来源十分清楚
- nonconvex penalized high-dimensional quantile regression with the SCAD 里面用的实际数据我看不懂！！

library(lqmm)
data(labor)

这个数据网站 https://www.data.gov/ 很不错

数据的查找基本功力有了，可以开始干活了！！！

kaggle竞赛网站还真的很不错！https://www.kaggle.com/competitions里面的数据集也很棒！

好的数据集

R包自带数据集合

网上搜集数据集