一、R-基本介绍

1.1 为什么要用R?

与起源于贝尔实验室的S语言类似,R是一种为统计计算和绘图而生的语言和环境,是一套开源的数据分析解决方案,并由庞大且活跃的全球性社区维护。

R语言的特性主要有:

  • R是免费。
  • R是全面的统计研究平台,几乎任何类型的数据分析需求都可以完成。
  • R的packages海量,囊括了其他软件中尚不可用的、先进的统计计算程序。
  • 画图功能十分强大。
  • 跨平台,且易于扩展(C++/python/spark/sas等)。

1.2 R及RStudio安装

  • R可直接从CRAN 上免费下载,不同的OS对应不同的二进制版本,根据自己电脑的操作系统选择响应版本安装即可。

  • Rstudio 是R的集成开发环境,使用异常方便。 下图是Rtudio的界面:

1.3 包及安装

包是R函数、数据、预编译代码以一种定义完善的格式组成的集合,R自带了一些列默认包,其他包可以通过下载来安装。

  • 安装命令:install.packages('包名")
  • 加载命令:library(包名)
  • 更新命令:update.packages(’包名“)`

附常用的R空间的管理函数:

二、R-数据分析

2.1 常用数据结构

  • 向量

    向量是用于存储数值型(例如:1,2,3,4,5)、字符型(例如:a,b,c)或逻辑型数据(例如:TRUE,FALSE)的一维数组。

   a<-c(1,2,3,4,5)
   a[2:4]   ## 取a的第2~4个数据
## [1] 2 3 4
   length(a) ## a的长度
## [1] 5
   a[1]<-10  ##赋值
   a
## [1] 10  2  3  4  5
  • 矩阵

    矩阵是一个二维数组,只是每个元素都拥有相同的模式(数值型、字符型或逻辑型)。

   mtrix<-matrix(rnorm(12),nrow=3,ncol=4)
   mtrix
##            [,1]       [,2]      [,3]       [,4]
## [1,] -1.2332728 -0.1311443 2.4518898  0.6408618
## [2,]  0.3848627  1.3397774 0.7949649 -0.3380689
## [3,] -1.3186845  1.0642058 0.8153079  0.5923581
   dim(mtrix)
## [1] 3 4
   dimnames(mtrix)<-list(c('row1','row2','row3'),c('col1','col2','col3','col4'))
   mtrix
##            col1       col2      col3       col4
## row1 -1.2332728 -0.1311443 2.4518898  0.6408618
## row2  0.3848627  1.3397774 0.7949649 -0.3380689
## row3 -1.3186845  1.0642058 0.8153079  0.5923581
  • 数据框

    与矩阵类似,不同的列可以包含不同模式(数值型、字符型等)的数据。

   data<-data.frame(c1=1, c2=1:10, c3=sample(c('a','b','c'), 10, replace = TRUE))
   dim(data)   
## [1] 10  3
   str(data)
## 'data.frame':    10 obs. of  3 variables:
##  $ c1: num  1 1 1 1 1 1 1 1 1 1
##  $ c2: int  1 2 3 4 5 6 7 8 9 10
##  $ c3: Factor w/ 3 levels "a","b","c": 2 2 1 3 3 1 1 1 3 3
   names(data)
## [1] "c1" "c2" "c3"
   head(data)
##   c1 c2 c3
## 1  1  1  b
## 2  1  2  b
## 3  1  3  a
## 4  1  4  c
## 5  1  5  c
## 6  1  6  a
   nrow(data)
## [1] 10
   data[1,]
##   c1 c2 c3
## 1  1  1  b
   data[,1]
##  [1] 1 1 1 1 1 1 1 1 1 1
   data[,-1]
##    c2 c3
## 1   1  b
## 2   2  b
## 3   3  a
## 4   4  c
## 5   5  c
## 6   6  a
## 7   7  a
## 8   8  a
## 9   9  c
## 10 10  c
  • 数组 与矩阵类似,但是维度可以大于2。
    example(array)
## 
## array> dim(as.array(letters))
## [1] 26
## 
## array> array(1:3, c(2,4)) # recycle 1:3 "2 2/3 times"
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    2    1
## [2,]    2    1    3    2
## 
## array> #     [,1] [,2] [,3] [,4]
## array> #[1,]    1    3    2    1
## array> #[2,]    2    1    3    2
## array> 
## array> 
## array>

2.2 数据导入及导出

2.2.1 数据导出

    library(hflights)
    write.csv(hflights,"hflights.csv",row.names = FALSE)

2.2.2 数据导入

R支持从各种数据源的数据导入:

  • 手动方式
  • 命令行方式
    hflights1<-read.csv("hflights.csv")
    str(hflights1)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : Factor w/ 15 levels "AA","AS","B6",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : Factor w/ 3320 levels "","N0EGMQ","N10156",..: 1763 1703 1654 1090 1379 470 1382 1331 1328 1479 ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : Factor w/ 2 levels "HOU","IAH": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Dest             : Factor w/ 116 levels "ABQ","AEX","AGS",..: 33 33 33 33 33 33 33 33 33 33 ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : Factor w/ 5 levels "","A","B","C",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
    head(hflights1)
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      13         0                         0
## 2      6       9         0                         0
## 3      5      17         0                         0
## 4      9      22         0                         0
## 5      9       9         0                         0
## 6      6      13         0                         0
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Warning: running command ''/usr/bin/otool' -L '/Library/Frameworks/
## R.framework/Resources/library/tcltk/libs//tcltk.so'' had status 1
## Loading required package: RSQLite
## Loading required package: DBI
hflights2<-read.csv.sql("hflights.csv",sql="select * from file where Dest='\"BNA\"'")
## Loading required package: tcltk
##sqldf不能自动识别双引号,进行了转义
str(hflights2)
## 'data.frame':    3481 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 1 2 2 2 2 3 3 3 3 ...
##  $ DayOfWeek        : int  6 6 7 7 7 7 1 1 1 1 ...
##  $ DepTime          : int  1419 1232 1813 900 716 1357 2000 1142 811 1341 ...
##  $ ArrTime          : int  1553 1402 1948 1032 845 1529 2132 1317 945 1519 ...
##  $ UniqueCarrier    : chr  "\"WN\"" "\"WN\"" "\"WN\"" "\"WN\"" ...
##  $ FlightNum        : int  1454 2360 41 1107 1216 3410 485 1425 1628 2198 ...
##  $ TailNum          : chr  "\"N364SW\"" "\"N665WN\"" "\"N397SW\"" "\"N433LV\"" ...
##  $ ActualElapsedTime: int  94 90 95 92 89 92 92 95 94 98 ...
##  $ AirTime          : int  78 77 81 76 78 79 79 83 81 84 ...
##  $ ArrDelay         : int  38 -18 -2 22 0 29 37 -3 20 4 ...
##  $ DepDelay         : int  49 -3 8 35 16 42 50 7 31 11 ...
##  $ Origin           : chr  "\"HOU\"" "\"HOU\"" "\"HOU\"" "\"HOU\"" ...
##  $ Dest             : chr  "\"BNA\"" "\"BNA\"" "\"BNA\"" "\"BNA\"" ...
##  $ Distance         : int  670 670 670 670 670 670 670 670 670 670 ...
##  $ TaxiIn           : int  5 5 7 5 4 5 4 5 6 7 ...
##  $ TaxiOut          : int  11 8 7 11 7 8 9 7 7 7 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "\"\"" "\"\"" "\"\"" "\"\"" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
library(data.table)
## Warning: package 'data.table' was built under R version 3.3.2
## Warning: closing unused connection 6 (hflights.csv)
## Warning: closing unused connection 5 (hflights.csv)
hflights3<-fread("hflights.csv")
str(hflights3)
## Classes 'data.table' and 'data.frame':   227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

2.3 常用数据操作

2.3.1 基本操作

数据操作常用的包:sqldf,dplyr,dtable,resharp2…

  • 类型转换:as.integer(), as.factor(),as.character()…
  • 取子集:subset(),filter()…
  • 合并操作:merge(),cbind(),rcbind()…
  • 汇总操作:table(),apply函数族(lapply(),sapply(),tapply()),aggregate(),ddply()…
  • 字符串操作:nchar(x), grep(),substr(),strsplit(),paste(),toupper(),tolower()…

2.3.2 统计操作

  • 均值、中位值、方差、标准差:mean(),median(),var(),sd()
  • 分位值函数:quantile()
  • 最大、最小、求和:max()、min()、sum()
  • 差分:diff()
  • 标准化:scale()
  • 随机抽样 sample()
  • 相关系数 cor()

2.3.3 自定义函数

格式:

myfunction<-function(arg1,arg2,…){

statements

return(objects)

}

2.3.4 画图操作

  • 基本图形
## 
## 
##  demo(graphics)
##  ---- ~~~~~~~~
## 
## > #  Copyright (C) 1997-2009 The R Core Team
## > 
## > require(datasets)
## 
## > require(grDevices); require(graphics)
## 
## > ## Here is some code which illustrates some of the differences between
## > ## R and S graphics capabilities.  Note that colors are generally specified
## > ## by a character string name (taken from the X11 rgb.txt file) and that line
## > ## textures are given similarly.  The parameter "bg" sets the background
## > ## parameter for the plot and there is also an "fg" parameter which sets
## > ## the foreground color.
## > 
## > 
## > x <- stats::rnorm(50)
## 
## > opar <- par(bg = "white")
## 
## > plot(x, ann = FALSE, type = "n")

## 
## > abline(h = 0, col = gray(.90))
## 
## > lines(x, col = "green4", lty = "dotted")
## 
## > points(x, bg = "limegreen", pch = 21)
## 
## > title(main = "Simple Use of Color In a Plot",
## +       xlab = "Just a Whisper of a Label",
## +       col.main = "blue", col.lab = gray(.8),
## +       cex.main = 1.2, cex.lab = 1.0, font.main = 4, font.lab = 3)
## 
## > ## A little color wheel.    This code just plots equally spaced hues in
## > ## a pie chart.    If you have a cheap SVGA monitor (like me) you will
## > ## probably find that numerically equispaced does not mean visually
## > ## equispaced.  On my display at home, these colors tend to cluster at
## > ## the RGB primaries.  On the other hand on the SGI Indy at work the
## > ## effect is near perfect.
## > 
## > par(bg = "gray")
## 
## > pie(rep(1,24), col = rainbow(24), radius = 0.9)

## 
## > title(main = "A Sample Color Wheel", cex.main = 1.4, font.main = 3)
## 
## > title(xlab = "(Use this as a test of monitor linearity)",
## +       cex.lab = 0.8, font.lab = 3)
## 
## > ## We have already confessed to having these.  This is just showing off X11
## > ## color names (and the example (from the postscript manual) is pretty "cute".
## > 
## > pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
## 
## > names(pie.sales) <- c("Blueberry", "Cherry",
## +              "Apple", "Boston Cream", "Other", "Vanilla Cream")
## 
## > pie(pie.sales,
## +     col = c("purple","violetred1","green3","cornsilk","cyan","white"))

## 
## > title(main = "January Pie Sales", cex.main = 1.8, font.main = 1)
## 
## > title(xlab = "(Don't try this at home kids)", cex.lab = 0.8, font.lab = 3)
## 
## > ## Boxplots:  I couldn't resist the capability for filling the "box".
## > ## The use of color seems like a useful addition, it focuses attention
## > ## on the central bulk of the data.
## > 
## > par(bg="cornsilk")
## 
## > n <- 10
## 
## > g <- gl(n, 100, n*100)
## 
## > x <- rnorm(n*100) + sqrt(as.numeric(g))
## 
## > boxplot(split(x,g), col="lavender", notch=TRUE)

## 
## > title(main="Notched Boxplots", xlab="Group", font.main=4, font.lab=1)
## 
## > ## An example showing how to fill between curves.
## > 
## > par(bg="white")
## 
## > n <- 100
## 
## > x <- c(0,cumsum(rnorm(n)))
## 
## > y <- c(0,cumsum(rnorm(n)))
## 
## > xx <- c(0:n, n:0)
## 
## > yy <- c(x, rev(y))
## 
## > plot(xx, yy, type="n", xlab="Time", ylab="Distance")

## 
## > polygon(xx, yy, col="gray")
## 
## > title("Distance Between Brownian Motions")
## 
## > ## Colored plot margins, axis labels and titles.    You do need to be
## > ## careful with these kinds of effects.    It's easy to go completely
## > ## over the top and you can end up with your lunch all over the keyboard.
## > ## On the other hand, my market research clients love it.
## > 
## > x <- c(0.00, 0.40, 0.86, 0.85, 0.69, 0.48, 0.54, 1.09, 1.11, 1.73, 2.05, 2.02)
## 
## > par(bg="lightgray")
## 
## > plot(x, type="n", axes=FALSE, ann=FALSE)

## 
## > usr <- par("usr")
## 
## > rect(usr[1], usr[3], usr[2], usr[4], col="cornsilk", border="black")
## 
## > lines(x, col="blue")
## 
## > points(x, pch=21, bg="lightcyan", cex=1.25)
## 
## > axis(2, col.axis="blue", las=1)
## 
## > axis(1, at=1:12, lab=month.abb, col.axis="blue")
## 
## > box()
## 
## > title(main= "The Level of Interest in R", font.main=4, col.main="red")
## 
## > title(xlab= "1996", col.lab="red")
## 
## > ## A filled histogram, showing how to change the font used for the
## > ## main title without changing the other annotation.
## > 
## > par(bg="cornsilk")
## 
## > x <- rnorm(1000)
## 
## > hist(x, xlim=range(-4, 4, x), col="lavender", main="")

## 
## > title(main="1000 Normal Random Variates", font.main=3)
## 
## > ## A scatterplot matrix
## > ## The good old Iris data (yet again)
## > 
## > pairs(iris[1:4], main="Edgar Anderson's Iris Data", font.main=4, pch=19)

## 
## > pairs(iris[1:4], main="Edgar Anderson's Iris Data", pch=21,
## +       bg = c("red", "green3", "blue")[unclass(iris$Species)])

## 
## > ## Contour plotting
## > ## This produces a topographic map of one of Auckland's many volcanic "peaks".
## > 
## > x <- 10*1:nrow(volcano)
## 
## > y <- 10*1:ncol(volcano)
## 
## > lev <- pretty(range(volcano), 10)
## 
## > par(bg = "lightcyan")
## 
## > pin <- par("pin")
## 
## > xdelta <- diff(range(x))
## 
## > ydelta <- diff(range(y))
## 
## > xscale <- pin[1]/xdelta
## 
## > yscale <- pin[2]/ydelta
## 
## > scale <- min(xscale, yscale)
## 
## > xadd <- 0.5*(pin[1]/scale - xdelta)
## 
## > yadd <- 0.5*(pin[2]/scale - ydelta)
## 
## > plot(numeric(0), numeric(0),
## +      xlim = range(x)+c(-1,1)*xadd, ylim = range(y)+c(-1,1)*yadd,
## +      type = "n", ann = FALSE)

## 
## > usr <- par("usr")
## 
## > rect(usr[1], usr[3], usr[2], usr[4], col="green3")
## 
## > contour(x, y, volcano, levels = lev, col="yellow", lty="solid", add=TRUE)
## 
## > box()
## 
## > title("A Topographic Map of Maunga Whau", font= 4)
## 
## > title(xlab = "Meters North", ylab = "Meters West", font= 3)
## 
## > mtext("10 Meter Contour Spacing", side=3, line=0.35, outer=FALSE,
## +       at = mean(par("usr")[1:2]), cex=0.7, font=3)
## 
## > ## Conditioning plots
## > 
## > par(bg="cornsilk")
## 
## > coplot(lat ~ long | depth, data = quakes, pch = 21, bg = "green3")

## 
## > par(opar)
  • 高级绘图:使用ggplot2包

ggplot2是一个用来绘制统计图像的R软件包,与其他大多数的图形软件不同,ggplot2是由其背后的一套图形语法所支持的。

这套语法告诉我们,一张统计图形就是从数据到 几何对象(geometric object ,缩写gemo,包括点、线、条形等)的 图形属性(aesthetic attributes, 缩写aes,包括颜色、形状、大小等)的一个映射。此外,图形中还可能包含数据的 统计变换(statistical transformation,缩写为stats),最后绘制在某个特定的 坐标系(coordinate system,缩写为coord)中,而 分面(facet,指讲绘图窗口划分为若干个子窗口)则可以用来生成数据不同子集的图形。

总之,一张统计图形就是由上述这些独立的图形部件所组成的。

图形绘制可参考:ggplot

三、R-建模入门(待续。。)