找数据总是一件很烦的事情,这个文档致力于收好用的常用的数据集合。

R包自带数据集合

  1. datasets 包中自带了一系列数据集合, instruction
library(datasets)
data(swiss)
# swiss 数据集合是个不错的好用的数据集合
  1. 各种其他R包自带的数据集,查看方式是 data(package = "package name")

网上搜集数据集

  1. 搜索“lasso real data” –> 找到 “prostate data(Stamey et.al)”,然后搜索它 –> 找到 http://www.stat.wisc.edu/~gvludwig/fall_2012/handout15.R –> 数据的使用方法 http://www.lisa.stat.vt.edu/sites/default/files/Model%20selection%20in%20R%20featuring%20the%20lasso.pdf
# The following dataset is from Hastie, Tibshirani and Friedman (2009), from a study 
# by Stamey et al. (1989) of prostate cancer, measuring the correlation between the level
# of a prostate-specific antigen and some covariates. The covariates are
#
# * lcavol  : log-cancer volume
# * lweight : log-prostate weight
# * age     : age of patient
# * lbhp    : log-amount of benign hyperplasia
# * svi     : seminal vesicle invasion
# * lcp     : log-capsular penetration
# * gleason : Gleason Score, check http://en.wikipedia.org/wiki/Gleason_Grading_System
# * pgg45   : percent of Gleason scores 4 or 5
#
# And lpsa is the response variable, log-psa.

url <- "http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/prostate.data"
str(pcancer <- read.table(url, header=TRUE))
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
  1. Haste 开发的软件包里面有各种需要的数据 http://web.stanford.edu/~hastie/swData.htm !!! ’ - lars 包 http://www.stanford.edu/~hastie/Papers/LARS/
  2. 搜索 “quantile regression real data”
library(lqmm)
data(labor)
  1. 这个数据网站 https://www.data.gov/ 很不错

数据的查找基本功力有了,可以开始干活了!!!

  1. kaggle竞赛网站还真的很不错!https://www.kaggle.com/competitions里面的数据集也很棒!