dataframe 数据框

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxuc3R1ZGVudHNcbm1hdGhfc2NvcmU8LWMoMTAwLCA4MCwgNzAsIDk1KVxubWF0aF9zY29yZVxuZW5nbGlzaF9zY29yZTwtYyg5NiwgODYsIDc3LCA5OSlcbmVuZ2xpc2hfc2NvcmUifQ==
在这种情况下,使用数据框是最佳选择。
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxuc3R1ZGVudHNcbm1hdGhfc2NvcmU8LWMoMTAwLCA4MCwgNzAsIDk1KVxubWF0aF9zY29yZVxuZW5nbGlzaF9zY29yZTwtYyg5NiwgODYsIDc3LCA5OSlcbmVuZ2xpc2hfc2NvcmVcblxuc3R1ZGVudHNfc2NvcmVzPC1kYXRhLmZyYW1lKHN0dWRlbnRzLCBtYXRoX3Njb3JlLCBlbmdsaXNoX3Njb3JlKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxubWF0aF9zY29yZTwtYygxMDAsIDgwLCA3MCwgOTUpXG5lbmdsaXNoX3Njb3JlPC1jKDk2LCA4NiwgNzcsIDk5KVxuXG5zdHVkZW50c19zY29yZXM8LWRhdGEuZnJhbWUoc3R1ZGVudHMsIG1hdGhfc2NvcmUsIGVuZ2xpc2hfc2NvcmUpXG5cbnN0dWRlbnRzX3Njb3Jlc1ssMl1cblxuc3R1ZGVudHNfc2NvcmVzWyxcIm1hdGhfc2NvcmVcIl1cblxuc3R1ZGVudHNfc2NvcmVzJG1hdGhfc2NvcmUifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxubWF0aF9zY29yZTwtYygxMDAsIDgwLCA3MCwgOTUpXG5lbmdsaXNoX3Njb3JlPC1jKDk2LCA4NiwgNzcsIDk5KVxuc3R1ZGVudHNfc2NvcmVzPC1kYXRhLmZyYW1lKHN0dWRlbnRzLCBtYXRoX3Njb3JlLCBlbmdsaXNoX3Njb3JlKVxuXG5kYXRhLmZyYW1lKHN0dWRlbnRzX3Njb3JlcyRzdHVkZW50cywgc3R1ZGVudHNfc2NvcmVzJG1hdGhfc2NvcmUpIn0=

list 列表

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhIDwtIFwiTXkgRmlyc3QgTGlzdFwiXG5iIDwtIGMoMjUsIDI2LCAxOCwgMzkpXG5jIDwtIG1hdHJpeCgxOjEwLCBucm93PTUpXG5kIDwtIGMoXCJvbmVcIiwgXCJ0d29cIiwgXCJ0aHJlZVwiKVxuXG5teWxpc3QgPC0gbGlzdCh0aXRsZT1hICxiLGMsZClcbm15bGlzdCJ9

本例创建了一个列表,其中有四个成分:一个字符串、一个数值型向量、一个矩阵以及一个字符型向量。可以组合任意多的对象,并将它们保存为一个列表。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjYXJzIDwtIGxtKGZvcm11bGEgPSB3dH5tcGcsIGRhdGEgPSBtdGNhcnMpXG5zdW1tYXJ5KGNhcnMpIn0=

常用函数

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsZW5ndGgobXRjYXJzKVxuXG5sZW5ndGgobXRjYXJzJG1wZylcblxuZGltKG10Y2Fycylcblxuc3RyKG10Y2FycylcblxuY2xhc3MobXRjYXJzKVxuXG5uYW1lcyhtdGNhcnMpIn0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjKDIsIDIwKVxuXG5jYmluZChzdHVkZW50cywgbWF0aF9zY29yZSlcblxucmJpbmQoc3R1ZGVudHMsIG1hdGhfc2NvcmUpXG5cbmhlYWQobXRjYXJzKVxuXG50YWlsKG10Y2FycylcblxubHMoKVxuXG5ybShhLCBiLCBjKVxuXG5scygpIn0=

课后练习一:

请计算半径为5的圆的面积,pi取值3.14:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIzLjE0KiIsInNvbHV0aW9uIjoiMy4xNCo1XjIifQ==
请用<- 创建变量 my_height,并输出:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSBhIHZhcmlhYmxlIG15X2hlaWdodCwgZXF1YWwgdG8gd2hhdGV2ZXIgeW91ciBoZWlnaHQgaXMsXG5cbiMgUHJpbnQgb3V0IG15X2hlaWdodCIsInNvbHV0aW9uIjoiIyBDcmVhdGUgYSB2YXJpYWJsZSBteV9oZWlnaHQsIGVxdWFsIHRvIHdoYXRldmVyIHlvdXIgaGVpZ2h0IGlzLFxubXlfaGVpZ2h0PC0xNzVcbiAgXG4jIFByaW50IG91dCBteV9oZWlnaHRcbm15X2hlaWdodCJ9
请以两位导师的名字创建向量my_teachers(提示,character的向量里面的元素需要用双引号):
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJteV90ZWFjaGVyczwtIiwic29sdXRpb24iOiJteV90ZWFjaGVyczwtYyhcIkNocmlzXCIsIFwiSmFzb25cIilcbm15X3RlYWNoZXJzIn0=
请把Jason 从 my_teachers里面召唤出来:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X3RlYWNoZXJzPC1jKFwiQ2hyaXNcIiwgXCJKYXNvblwiKSIsInNhbXBsZSI6IiMgSmFzb246Iiwic29sdXRpb24iOiIjIEphc29uOlxubXlfdGVhY2hlcnNbMl0ifQ==
请提取矩阵 my_data 第3行第5列的元素, my_data已在后台给出:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X2RhdGE8LW1hdHJpeCgxOjEwMCwgbnJvdz0xMCkiLCJzYW1wbGUiOiJteV9kYXRhIiwic29sdXRpb24iOiJteV9kYXRhWzMsIDVdIn0=
请提取矩阵 my_data 第3行第5列到8列的元素, my_data已在后台给出:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X2RhdGE8LW1hdHJpeCgxOjEwMCwgbnJvdz0xMCkiLCJzYW1wbGUiOiJteV9kYXRhIiwic29sdXRpb24iOiJteV9kYXRhWzMsIGMoNTo4KV0ifQ==
在自己本地电脑上的RStudio Console里用 ? 查看plot()这个函数的帮助文档,拉到最下面,赋值黏贴第一个例子,并运行:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjcGxvdCIsInNvbHV0aW9uIjoicmVxdWlyZShzdGF0cykgIyBmb3IgbG93ZXNzLCBycG9pcywgcm5vcm1cbnBsb3QoY2FycylcbmxpbmVzKGxvd2VzcyhjYXJzKSkifQ==

最后一题:

在自己电脑上建立一个文件夹R_learning, 使用setwd(“mypath”), 建立指向该文件夹的工作路径。

完结撒花~

基本作图

一图胜千言

plot(x, y), 将x置于横轴,将y置于纵轴,绘制点集(x, y),散点图。使用help(plot)可以查看其他选项。

下面的代码打开一个图形窗口并生成了一幅散点图,横轴表 示车身重量,纵轴为每加仑汽油行驶的英里数。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZykifQ==

添加横坐标标签,添加纵坐标标签

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgeGxhYj1cIk1pbGVzIFBlciBHYWxsb25cIixcbiAgICAgeWxhYj1cIkNhciBXZWlnaHRcIikifQ==

添加回归曲线

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgeGxhYj1cIk1pbGVzIFBlciBHYWxsb25cIixcbiAgICAgeWxhYj1cIkNhciBXZWlnaHRcIilcbmFibGluZShsbShtdGNhcnMkbXBnfm10Y2FycyR3dCkpXG50aXRsZShcIlJlZ3Jlc3Npb24gb2YgTVBHIG9uIFdlaWdodFwiKSJ9

改变点的颜色和形状

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgICAgIHhsYWI9XCJNaWxlcyBQZXIgR2FsbG9uXCIsXG4gICAgICAgICB5bGFiPVwiQ2FyIFdlaWdodFwiLFxuICAgICBjb2w9NCxcbiAgICAgcGNoPTE2KVxuYWJsaW5lKGxtKG10Y2FycyRtcGd+bXRjYXJzJHd0KSlcbnRpdGxlKFwiUmVncmVzc2lvbiBvZiBNUEcgb24gV2VpZ2h0XCIpIn0=

有这些形状可以选择:

总是用美元符号,是不是太麻烦?换一种方式:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ3aXRoKG10Y2Fycyx7XG5wbG90KHd0LCBtcGcpXG5hYmxpbmUobG0obXBnfnd0KSlcbnRpdGxlKFwiUmVncmVzc2lvbiBvZiBNUEcgb24gV2VpZ2h0XCIpXG59XG4pIn0=

全局参数设定,多图同列, 例如设置2列2行, 四个直方图:

上面的图是这么实现的:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwYXIobWZyb3c9YygzLDEpKVxud2l0aChtdGNhcnMse1xuICBwYXIobWZyb3c9YygyLDIpKVxuICBoaXN0KHd0KVxuICBoaXN0KG1wZylcbiAgaGlzdChkaXNwKVxuICBoaXN0KGhwKVxufSkifQ==

箱线图:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJib3hwbG90KG10Y2FycyRtcGcpIn0=

保存图形:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwZGYoXCJteWdyYXBoLnBkZlwiKVxuICAgICAgYXR0YWNoKG10Y2FycylcbiAgICAgIHBsb3Qod3QsIG1wZylcbiAgICAgIGFibGluZShsbShtcGd+d3QpKVxuICAgICAgdGl0bGUoXCJSZWdyZXNzaW9uIG9mIE1QRyBvbiBXZWlnaHRcIilcbiAgICAgIGRldGFjaChtdGNhcnMpXG5kZXYub2ZmKCkifQ==

除了pdf(),还可以使用函数win.metafile()、png()、jpeg()、bmp()等将图形保存为其他格式。

课后练习二:

请使用str(), names()函数来观察airquality这个数据的变量:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHIoKSIsInNvbHV0aW9uIjoic3RyKGFpcnF1YWxpdHkpXG5uYW1lcyhhaXJxdWFsaXR5KSJ9
summary来计算airquality数据里所有变量各自的平均值,最小值,最大值:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIGNob29zZSBhbnkgZnVuY3Rpb25zIHlvdSBsaWtlOiIsInNvbHV0aW9uIjoic3VtbWFyeShhaXJxdWFsaXR5KSJ9
请用plot()函数创建风速与风度的散点图:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KCkiLCJzb2x1dGlvbiI6IndpdGgoYWlycXVhbGl0eSx7XG4gICAgICAgcGxvdCh4PVdpbmQsIHk9VGVtcCxcbiAgICAgICAgICAgIHhsYWI9XCJ3aW5kIHNwZWVkXCIsXG4gICAgICAgICAgICB5bGFiID0gXCJUZW1wZXJhdHVyZVwiKVxuICAgIH0pIn0=
请在上图的基础上,添加回归曲线和标题“Weather in NYC”:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIGluIGFkZGl0aW9uLCB5b3UgbWF5IHRyeSB0byBpbXBsZW1lbnQgd2l0aCBmdW5jdGlvbjoiLCJzb2x1dGlvbiI6IndpdGgoYWlycXVhbGl0eSx7XG4gICAgICAgcGxvdCh4PVdpbmQsIHk9VGVtcCxcbiAgICAgICAgICAgIHhsYWI9XCJ3aW5kIHNwZWVkXCIsXG4gICAgICAgICAgICB5bGFiID0gXCJUZW1wZXJhdHVyZVwiKVxuICBhYmxpbmUobG0oV2luZH5UZW1wKSlcbiAgdGl0bGUoXCJXZWF0aGVyIGluIE5ZQ1wiKVxuICAgIH0pIn0=

最后一题,请使用pdf("mygraph.pdf")pdf("mygraph.pdf")将上面的图形保存到你的作业文件夹(本地硬盘)。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ3aXRoKGFpcnF1YWxpdHkse1xuICAgICAgIHBsb3QoeD1XaW5kLCB5PVRlbXAsXG4gICAgICAgICAgICB4bGFiPVwid2luZCBzcGVlZFwiLFxuICAgICAgICAgICAgeWxhYiA9IFwiVGVtcGVyYXR1cmVcIilcbiAgYWJsaW5lKGxtKFdpbmR+VGVtcCkpXG4gIHRpdGxlKFwiV2VhdGhlciBpbiBOWUNcIilcbiAgICB9KSJ9

读取数据

读取数据是数据分析的第一步,应该 so easy!

然而现实并非如此,因为真实的世界里,数据的格式百花齐放:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhZDwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL202amg1a3NwaWFubTIxNS9hZHZlcnRpc2luZy5jc3Y/ZGw9MVwiKSJ9

许多数据以flat files形式出现: simple tabular text files. 我们首先学习 how to read CSV and text files in R。 需要用到package utils里面的read.csv()函数,是R自带的。

我们来使用swimming_pools.csv做练习。it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au).

可以使用我提供的dropbox链接下载数据到自己的本地硬盘,然后将其路径输入到read.csv(),或者直接使用dropbox链接读取。

read.csv() 默认将string 读成factor,可以通过设定选项stringsAsFactors = FALSE来修改。

看看结果有什么不同:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwb29sczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3VuYnlxZnNwYjk1cnNxbS9zd2ltbWluZ19wb29scy5jc3Y/ZGw9MVwiKVxuc3RyKHBvb2xzKVxuXG5wb29sczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3VuYnlxZnNwYjk1cnNxbS9zd2ltbWluZ19wb29scy5jc3Y/ZGw9MVwiLCBzdHJpbmdzQXNGYWN0b3JzID0gRkFMU0UpXG5cbnN0cihwb29scykifQ==

我们再学习一下如何读取文本数据。 .txt files which are basically text files

使用的函数是read.delim(),默认是数据点分隔方式是 \t。(fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names)。

练习读取数据 hotdogs.txt, containing information on sodium and calorie levels in different hotdogs (Source: UCLA)。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJob3Rkb2dzPC1yZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy8waGJyYXFjcGw0ZHc4dzgvaG90ZG9ncy50eHQ/ZGw9MVwiLCBzZXA9XCJcXHRcIixoZWFkZXIgPSBUUlVFKVxuIyBTdW1tYXJpemUgaG90ZG9nc1xuc3VtbWFyeShob3Rkb2dzKSJ9

使用read.table()来通吃各种非正常数据,需要具体设置各种参数。 具体的我们来看个例子,依旧用hotdogs数据:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFBhdGggdG8gdGhlIGhvdGRvZ3MudHh0IGZpbGU6IHBhdGhcbnBhdGggPC0gZmlsZS5wYXRoKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy8waGJyYXFjcGw0ZHc4dzgvaG90ZG9ncy50eHQ/ZGw9MVwiKVxuXG4jIEltcG9ydCB0aGUgaG90ZG9ncy50eHQgZmlsZTogaG90ZG9nc1xuaG90ZG9ncyA8LSByZWFkLnRhYmxlKHBhdGgsIFxuICAgICAgICAgICAgICAgICAgICAgIHNlcCA9IFwiXFx0XCIsIFxuICAgICAgICAgICAgICAgICAgICAgIGNvbC5uYW1lcyA9IGMoXCJ0eXBlXCIsIFwiY2Fsb3JpZXNcIiwgXCJzb2RpdW1cIikpXG5cbiMgQ2FsbCBoZWFkKCkgb24gaG90ZG9nc1xuaGVhZChob3Rkb2dzKSJ9

有两个好朋友,李雷和韩梅梅想分享一个hotdog,但是在到底吃哪一个上面,他们意见不合。所以决定各自吃一个,李雷想要卡路里最少的,韩梅梅想吃含盐量最少的,我们来帮他们选一下吧!

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmlzaCB0aGUgcmVhZC5kZWxpbSgpIGNhbGxcbiMgUGF0aCB0byB0aGUgaG90ZG9ncy50eHQgZmlsZTogcGF0aFxucGF0aCA8LSBmaWxlLnBhdGgoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIpXG5cbiMgSW1wb3J0IHRoZSBob3Rkb2dzLnR4dCBmaWxlOiBob3Rkb2dzXG5ob3Rkb2dzIDwtIHJlYWQudGFibGUocGF0aCwgXG4gICAgICAgICAgICAgICAgICAgICAgc2VwID0gXCJcXHRcIiwgXG4gICAgICAgICAgICAgICAgICAgICAgY29sLm5hbWVzID0gYyhcInR5cGVcIiwgXCJjYWxvcmllc1wiLCBcInNvZGl1bVwiKSlcblxuIyBTZWxlY3QgdGhlIGhvdCBkb2cgd2l0aCB0aGUgbGVhc3QgY2Fsb3JpZXM6IGxpbHlcbkxpX0xlaSA8LSBob3Rkb2dzW3doaWNoLm1pbihob3Rkb2dzJGNhbG9yaWVzKSwgXVxuXG4jIFNlbGVjdCB0aGUgb2JzZXJ2YXRpb24gd2l0aCB0aGUgbW9zdCBzb2RpdW06IHRvbVxuSGFuX01laW1laSA8LSBob3Rkb2dzW3doaWNoLm1pbihob3Rkb2dzJHNvZGl1bSksIF1cblxuXG4jIFByaW50IExpX0xlaSwgSGFuX01laW1laVxuXG5MaV9MZWlcbkhhbl9NZWltZWkifQ==

单独设置每一个变量的数据类型。 如果我有好几个string类型的变量在我的数据里,我想要有的是string,有的是logical,还有的是factor,怎么破?

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJob3Rkb2dzIDwtIHJlYWQuZGVsaW0oXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIsIGhlYWRlciA9IEZBTFNFLCBjb2wubmFtZXMgPSBjKFwidHlwZVwiLCBcImNhbG9yaWVzXCIsIFwic29kaXVtXCIpKVxuXG4jIERpc3BsYXkgc3RydWN0dXJlIG9mIGhvdGRvZ3NcbnN0cihob3Rkb2dzKVxuXG4jIEVkaXQgdGhlIGNvbENsYXNzZXMgYXJndW1lbnQgdG8gaW1wb3J0IHRoZSBkYXRhIGNvcnJlY3RseTogaG90ZG9nczJcbmhvdGRvZ3MyIDwtIHJlYWQuZGVsaW0oXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIsIGhlYWRlciA9IEZBTFNFLCBcbiAgICAgICAgICAgICAgICAgICAgICAgY29sLm5hbWVzID0gYyhcInR5cGVcIiwgXCJjYWxvcmllc1wiLCBcInNvZGl1bVwiKSxcbiAgICAgICAgICAgICAgICAgICAgICAgY29sQ2xhc3NlcyA9IGMoXCJmYWN0b3JcIiwgXCJOVUxMXCIsIFwibnVtZXJpY1wiKSlcblxuIyBEaXNwbGF5IHN0cnVjdHVyZSBvZiBob3Rkb2dzMlxuc3RyKGhvdGRvZ3MyKSJ9

编写函数

到目前为止你已经接触并使用了很多R的函数,现在我们要学习编写函数?为什么要编写函数,不是已经有很多现成的吗?再说,我又不编写r packages。我们学习编写函数有两个原因:

  1. make your code more readable, avoid coding errors
  2. automate repetitive tasks

函数的模版一般长这个样子: my_fun <- function(arg1, arg2) { # body }

  1. my_fun is the variable that you want to assign your function to
  2. arg1 and arg2 are arguments to the function. The template has two arguments, but you can specify any number of arguments, each separated by a comma.
  3. You then replace # body with the R code that your function will execute, referring to the inputs by the argument names you specified.

来,我们编写一个计算圆面积的函数 size():

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBzaXplKCkgZnVuY3Rpb25cbm15X2Z1biA8LSBmdW5jdGlvbihhcmcxKSB7XG4gICMgYm9keVxufVxuXG4jIENhbGwgc2l6ZSgpIHdpdGggYXJndW1lbnQgNSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkZjwtZGF0YS5mcmFtZShcbiAgYSA9IHJub3JtKDEwKSxcbiAgYiA9IHJub3JtKDEwKSxcbiAgYyA9IHJub3JtKDEwKSxcbiAgZCA9IHJub3JtKDEwKVxuKVxuIyBDcmVhdGUgbmV3IGRvdWJsZSB2ZWN0b3I6IG91dHB1dFxuXG5cbiMgQWx0ZXIgdGhlIGxvb3BcbmZvciAoaSBpbiBzZXFfYWxvbmcoZGYpKSB7XG4gICMgQ2hhbmdlIGNvZGUgdG8gc3RvcmUgcmVzdWx0IGluIG91dHB1dFxuICBwcmludChtZWRpYW4oZGZbW2ldXSkpXG59XG5cbiMgUHJpbnQgb3V0cHV0Iiwic29sdXRpb24iOiJkZjwtZGF0YS5mcmFtZShcbiAgYSA9IHJub3JtKDEwKSxcbiAgYiA9IHJub3JtKDEwKSxcbiAgYyA9IHJub3JtKDEwKSxcbiAgZCA9IHJub3JtKDEwKVxuKVxuIyBDcmVhdGUgbmV3IGRvdWJsZSB2ZWN0b3I6IG91dHB1dFxub3V0cHV0IDwtIHZlY3RvcihcImRvdWJsZVwiLCBuY29sKGRmKSlcblxuIyBBbHRlciB0aGUgbG9vcFxuZm9yIChpIGluIHNlcV9hbG9uZyhkZikpIHtcbiAgIyBDaGFuZ2UgY29kZSB0byBzdG9yZSByZXN1bHQgaW4gb3V0cHV0XG4gIG91dHB1dFtbaV1dIDwtIG1lZGlhbihkZltbaV1dKVxufVxuXG4jIFByaW50IG91dHB1dFxub3V0cHV0In0=

注意: 1. match arguments by positions, by names? by name is always perferred, especially when you have more than 2 arguments, or the arguments have default values. 2. when you call a function, you should place a space around = in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.

我们说过,函数可以帮助我们避免重复劳动同时减少出错,来,看个例子: 一个10*4 的数据集已经预备好了,我们要计算每一列的中位数,并把结果存起来。 编写完成这个任务的函数,我们要 1. 写一个空白的向量,以备存储计算结果 2. 用 seq_along() 来提取待计算的数据集的长度 3. 使用for loop 来迭代(从第一行计算到最后一行)

一步一步,我们来搭建另一个函数rescale01可以把一个数据集的每一列数据都变成0-1之间:

(df$a - min(df$a, na.rm = TRUE)) /(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))

首先,确定需要哪些输入inputs?很明显是df$a,简便起见,就叫它x吧。

定义一个示意的向量,我们可以把上面的一段代码写成这样:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTBcblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB4XG4oeCAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSAvXG4gIChtYXgoeCwgbmEucm0gPSBUUlVFKSAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSJ9

庖丁解牛,我们看看函数的核心里面都有什么,有重复出现的内容吗?我们可不可以简化一下?

  1. One obviously duplicated statement is min(x, na.rm = TRUE). It makes more sense for us just to calculate it once, store the result, and then refer to it when needed.
  2. In fact, since we also need the maximum value of x, it would be even better to calculate the range once, then refer to the first and second elements when they are needed.

具体我们这么操作:

  1. Define the intermediate variable rng to contain the range of x using the function range().
  2. Specify the na.rm() argument to automatically ignore any NAs in the vector.
  3. Rewrite the snippet to refer to this intermediate variable.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTBcblxuIyBEZWZpbmUgcm5nXG5cblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB0aGUgZWxlbWVudHMgb2Ygcm5nXG4oeCAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSAvXG4gIChtYXgoeCwgbmEucm0gPSBUUlVFKSAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKVxuXG4jIFVzZSB0aGUgZnVuY3Rpb24gdGVtcGxhdGUgdG8gY3JlYXRlIHRoZSByZXNjYWxlMDEgZnVuY3Rpb25cblxuIyBUZXN0IHlvdXIgZnVuY3Rpb24sIGNhbGwgcmVzY2FsZSB1c2luZyB0aGUgdmVjdG9yIHggYXMgdGhlIGFyZ3VtZW50Iiwic29sdXRpb24iOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTAgXG5cbiMgRGVmaW5lIHJuZ1xucm5nIDwtIHJhbmdlKHgsIG5hLnJtID0gVFJVRSlcblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB0aGUgZWxlbWVudHMgb2Ygcm5nXG4oeCAtIHJuZ1sxXSkgLyBcbiAgKHJuZ1syXSAtIHJuZ1sxXSlcblxuIyBVc2UgdGhlIGZ1bmN0aW9uIHRlbXBsYXRlIHRvIGNyZWF0ZSB0aGUgcmVzY2FsZTAxIGZ1bmN0aW9uXG5yZXNjYWxlMDEgPC0gZnVuY3Rpb24oeCkge1xuICBybmcgPC0gcmFuZ2UoeCwgbmEucm0gPSBUUlVFKVxuICAoeCAtIHJuZ1sxXSkgLyAocm5nWzJdIC0gcm5nWzFdKVxufVxuXG4jIFRlc3QgeW91ciBmdW5jdGlvbiwgY2FsbCByZXNjYWxlIHVzaW5nIHRoZSB2ZWN0b3IgeCBhcyB0aGUgYXJndW1lbnRcbnJlc2NhbGUwMSh4KSJ9

apply family

for loop 很有用,帮助我们完成重复性劳动并减少出错。不过呢,有时候要写的代码还是太多比较麻烦。所以apply family 闪亮登场,使我们的工作更容易。

lapply() 1. apply function over list or vector 2. output is a list

sapply() 1. apply function over list or vector 2. try to simplify list to array Note: Arrays are the R data objects which can store data in more than two dimensions.

vapply() 1. apply function over list or vector 2. explicitly specify output format 3. vapply() is safer than sapply()

具体怎么操作? 我们举个例子,简单直接: 1. 先写一个select_el()函数,让它提取向量里指定位置的元素 2. 我们使用这个函数, 结合lapply来分别提取数据里几个数学家的名字和生日

注意lapplyl是 list的意思,返回的值都是list。这个函数的作用是把其它的函数作用到数据集里面的每一个元素。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluaXRpb24gb2Ygc3BsaXRfbG93XG5waW9uZWVycyA8LSBjKFwiR0FVU1M6MTc3N1wiLCBcIkJBWUVTOjE3MDJcIiwgXCJQQVNDQUw6MTYyM1wiLCBcIlBFQVJTT046MTg1N1wiKVxuc3BsaXQgPC0gc3Ryc3BsaXQocGlvbmVlcnMsIHNwbGl0ID0gXCI6XCIpXG5zcGxpdF9sb3cgPC0gbGFwcGx5KHNwbGl0LCB0b2xvd2VyKVxuXG4jIEdlbmVyaWMgc2VsZWN0IGZ1bmN0aW9uXG5zZWxlY3RfZWwgPC0gZnVuY3Rpb24oeCwgaW5kZXgpIHtcbiAgeFtpbmRleF1cbn1cblxuIyBVc2UgbGFwcGx5KCkgdHdpY2Ugb24gc3BsaXRfbG93OiBuYW1lcyBhbmQgeWVhcnMiLCJzb2x1dGlvbiI6IiMgRGVmaW5pdGlvbiBvZiBzcGxpdF9sb3dcbnBpb25lZXJzIDwtIGMoXCJHQVVTUzoxNzc3XCIsIFwiQkFZRVM6MTcwMlwiLCBcIlBBU0NBTDoxNjIzXCIsIFwiUEVBUlNPTjoxODU3XCIpXG5zcGxpdCA8LSBzdHJzcGxpdChwaW9uZWVycywgc3BsaXQgPSBcIjpcIilcbnNwbGl0X2xvdyA8LSBsYXBwbHkoc3BsaXQsIHRvbG93ZXIpXG5cbiMgR2VuZXJpYyBzZWxlY3QgZnVuY3Rpb25cbnNlbGVjdF9lbCA8LSBmdW5jdGlvbih4LCBpbmRleCkge1xuICB4W2luZGV4XVxufVxuXG4jIFVzZSBsYXBwbHkoKSB0d2ljZSBvbiBzcGxpdF9sb3c6IG5hbWVzIGFuZCB5ZWFyc1xubmFtZXMgPC0gbGFwcGx5KHNwbGl0X2xvdywgc2VsZWN0X2VsLCBpbmRleCA9IDEpXG55ZWFycyA8LSBsYXBwbHkoc3BsaXRfbG93LCBzZWxlY3RfZWwsIGluZGV4ID0gMilcbm5hbWVzXG55ZWFycyJ9

我们现在使用另一个函数sapply()来做同样的事情, 看看结果有什么不一样?

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluaXRpb24gb2Ygc3BsaXRfbG93XG5waW9uZWVycyA8LSBjKFwiR0FVU1M6MTc3N1wiLCBcIkJBWUVTOjE3MDJcIiwgXCJQQVNDQUw6MTYyM1wiLCBcIlBFQVJTT046MTg1N1wiKVxuc3BsaXQgPC0gc3Ryc3BsaXQocGlvbmVlcnMsIHNwbGl0ID0gXCI6XCIpXG5zcGxpdF9sb3cgPC0gbGFwcGx5KHNwbGl0LCB0b2xvd2VyKVxuXG4jIEdlbmVyaWMgc2VsZWN0IGZ1bmN0aW9uXG5zZWxlY3RfZWwgPC0gZnVuY3Rpb24oeCwgaW5kZXgpIHtcbiAgeFtpbmRleF1cbn1cblxuIyBVc2UgbGFwcGx5KCkgdHdpY2Ugb24gc3BsaXRfbG93OiBuYW1lcyBhbmQgeWVhcnMiLCJzb2x1dGlvbiI6IiMgRGVmaW5pdGlvbiBvZiBzcGxpdF9sb3dcbnBpb25lZXJzIDwtIGMoXCJHQVVTUzoxNzc3XCIsIFwiQkFZRVM6MTcwMlwiLCBcIlBBU0NBTDoxNjIzXCIsIFwiUEVBUlNPTjoxODU3XCIpXG5zcGxpdCA8LSBzdHJzcGxpdChwaW9uZWVycywgc3BsaXQgPSBcIjpcIilcbnNwbGl0X2xvdyA8LSBsYXBwbHkoc3BsaXQsIHRvbG93ZXIpXG5cbiMgR2VuZXJpYyBzZWxlY3QgZnVuY3Rpb25cbnNlbGVjdF9lbCA8LSBmdW5jdGlvbih4LCBpbmRleCkge1xuICB4W2luZGV4XVxufVxuXG4jIFVzZSBsYXBwbHkoKSB0d2ljZSBvbiBzcGxpdF9sb3c6IG5hbWVzIGFuZCB5ZWFyc1xubmFtZXMgPC0gc2FwcGx5KHNwbGl0X2xvdywgc2VsZWN0X2VsLCBpbmRleCA9IDEpXG55ZWFycyA8LSBzYXBwbHkoc3BsaXRfbG93LCBzZWxlY3RfZWwsIGluZGV4ID0gMilcbm5hbWVzXG55ZWFycyJ9

你可能已经猜到了,ssapply()里面是simplify的意思!

再看看下面的例子来体会一下sapply的作用,一个温度数据的list已经在后台加载好了

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiB0ZW1wPC0gbGlzdChjKDMsICA3LCAgOSwgIDYsIC0xKSwgYyg2LCAgOSwgMTIsIDEzLCAgNSksIGMoNCwgIDgsICAzLCAtMSwgLTMpLFxuICAgICAgICAgICAgICAgIGMoMSAsIDQsICA3LCAgMiwgLTIpLCBjKDUsIDcsIDksIDQsIDIpLCBjKC0zLCAgNSwgIDgsICA5LCAgNCksIGMoMywgNiwgOSwgNCwgMSkpIiwic2FtcGxlIjoiIyB0ZW1wIGlzIGFscmVhZHkgYXZhaWxhYmxlIGluIHRoZSB3b3Jrc3BhY2VcblxuIyBDcmVhdGUgYSBmdW5jdGlvbiB0aGF0IHJldHVybnMgbWluIGFuZCBtYXggb2YgYSB2ZWN0b3I6IGV4dHJlbWVzXG5leHRyZW1lcyA8LSBmdW5jdGlvbih4KSB7XG4gIGMobWluID0gbWluKHgpLCBfX18gPSBfX18pXG59XG5cbiMgQXBwbHkgZXh0cmVtZXMoKSBvdmVyIHRlbXAgd2l0aCBzYXBwbHkoKVxuXG5cbiMgQXBwbHkgZXh0cmVtZXMoKSBvdmVyIHRlbXAgd2l0aCBsYXBwbHkoKSIsInNvbHV0aW9uIjoiIyB0ZW1wIGlzIGFscmVhZHkgYXZhaWxhYmxlIGluIHRoZSB3b3Jrc3BhY2VcblxuIyBDcmVhdGUgYSBmdW5jdGlvbiB0aGF0IHJldHVybnMgbWluIGFuZCBtYXggb2YgYSB2ZWN0b3I6IGV4dHJlbWVzXG5leHRyZW1lcyA8LSBmdW5jdGlvbih4KSB7XG4gIGMobWluID0gbWluKHgpLCBtYXggPSBtYXgoeCkpXG59XG5cbiMgQXBwbHkgZXh0cmVtZXMoKSBvdmVyIHRlbXAgd2l0aCBzYXBwbHkoKVxuc2FwcGx5KHRlbXAsIGV4dHJlbWVzKVxuXG4jIEFwcGx5IGV4dHJlbWVzKCkgb3ZlciB0ZW1wIHdpdGggbGFwcGx5KClcbmxhcHBseSh0ZW1wLCBleHRyZW1lcykifQ==
最后来看看vapply():
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiB0ZW1wPC0gbGlzdChjKDMsICA3LCAgOSwgIDYsIC0xKSwgYyg2LCAgOSwgMTIsIDEzLCAgNSksIGMoNCwgIDgsICAzLCAtMSwgLTMpLFxuICAgICAgICAgICAgICAgIGMoMSAsIDQsICA3LCAgMiwgLTIpLCBjKDUsIDcsIDksIDQsIDIpLCBjKC0zLCAgNSwgIDgsICA5LCAgNCksIGMoMywgNiwgOSwgNCwgMSkpIiwic2FtcGxlIjoiIyB0ZW1wIGlzIGFscmVhZHkgYXZhaWxhYmxlIGluIHRoZSB3b3Jrc3BhY2VcblxuIyBEZWZpbml0aW9uIG9mIGJhc2ljcygpXG5iYXNpY3MgPC0gZnVuY3Rpb24oeCkge1xuICBjKG1pbiA9IG1pbih4KSwgbWVhbiA9IG1lYW4oeCksIG1heCA9IG1heCh4KSlcbn1cblxuIyBBcHBseSBiYXNpY3MoKSBvdmVyIHRlbXAgdXNpbmcgdmFwcGx5KCkiLCJzb2x1dGlvbiI6IiMgdGVtcCBpcyBhbHJlYWR5IGF2YWlsYWJsZSBpbiB0aGUgd29ya3NwYWNlXG5cbiMgRGVmaW5pdGlvbiBvZiBiYXNpY3MoKVxuYmFzaWNzIDwtIGZ1bmN0aW9uKHgpIHtcbiAgYyhtaW4gPSBtaW4oeCksIG1lYW4gPSBtZWFuKHgpLCBtYXggPSBtYXgoeCkpXG59XG5cbiMgQXBwbHkgYmFzaWNzKCkgb3ZlciB0ZW1wIHVzaW5nIHZhcHBseSgpXG52YXBwbHkodGVtcCwgYmFzaWNzLCBudW1lcmljKDMpKSJ9

你真的理解了sapply()vapply()的差异了吗? 那你修复一下下面的代码?

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IiB0ZW1wPC0gbGlzdChjKDMsICA3LCAgOSwgIDYsIC0xKSwgYyg2LCAgOSwgMTIsIDEzLCAgNSksIGMoNCwgIDgsICAzLCAtMSwgLTMpLFxuICAgICAgICAgICAgICAgIGMoMSAsIDQsICA3LCAgMiwgLTIpLCBjKDUsIDcsIDksIDQsIDIpLCBjKC0zLCAgNSwgIDgsICA5LCAgNCksIGMoMywgNiwgOSwgNCwgMSkpIiwic2FtcGxlIjoiIyB0ZW1wIGlzIGFscmVhZHkgYXZhaWxhYmxlIGluIHRoZSB3b3Jrc3BhY2VcblxuIyBEZWZpbml0aW9uIG9mIHRoZSBiYXNpY3MoKSBmdW5jdGlvblxuYmFzaWNzIDwtIGZ1bmN0aW9uKHgpIHtcbiAgYyhtaW4gPSBtaW4oeCksIG1lYW4gPSBtZWFuKHgpLCBtZWRpYW4gPSBtZWRpYW4oeCksIG1heCA9IG1heCh4KSlcbn1cblxuIyBGaXggdGhlIGVycm9yOlxudmFwcGx5KHRlbXAsIGJhc2ljcywgbnVtZXJpYygzKSlcbnNhcHBseSh0ZW1wLCBiYXNpY3MsIG51bWVyaWMoMykpIiwic29sdXRpb24iOiIjIHRlbXAgaXMgYWxyZWFkeSBhdmFpbGFibGUgaW4gdGhlIHdvcmtzcGFjZVxuXG4jIERlZmluaXRpb24gb2YgdGhlIGJhc2ljcygpIGZ1bmN0aW9uXG5iYXNpY3MgPC0gZnVuY3Rpb24oeCkge1xuICBjKG1pbiA9IG1pbih4KSwgbWVhbiA9IG1lYW4oeCksIG1lZGlhbiA9IG1lZGlhbih4KSwgbWF4ID0gbWF4KHgpKVxufVxuXG4jIEZpeCB0aGUgZXJyb3I6XG52YXBwbHkodGVtcCwgYmFzaWNzLCBudW1lcmljKDQpKSJ9

完结撒花!