R 软件安装

要愉快地使用R, 我们需要走两步:

界面熟悉和基本操作

R是一种区分大小写的解释型语言。你可以在命令提示符(>)后每次输入并执行一条命令,或者一次性执行写在脚本文件中的一组命令。R中有多种数据类型,包括向量、矩阵、数据框(与数据集类似)以及列表(各种对象的集合)。

可以当作最基本的计算器来使用。

基本运算

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiI3KzdcbjctN1xuNyo3XG43LzdcbjdeMlxuc3FydCg3KVxuXG5tZWFuKG10Y2FycyRtcGcpXG5cbnZhcihtdGNhcnMkbXBnKVxuXG5zZChtdGNhcnMkbXBnKVxuXG5tb2RlKG10Y2FycyRtcGcpXG5cbnF1YW50aWxlKG10Y2FycyRtcGcpIn0=

赋值

R使用<-作为赋值符号。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJteV92YXI8LTQyXG5teV92YXIifQ==

对象

一个对象可以是任何能被赋值的东西。对于R来说,对象可以是任何东西(数据、函数、图形、分析结果,等等)

c() 这个函数

我们也可以使用c() 这个函数 function(c 意指是 combine)来赋值,它把多个对象放到一起,组成向量。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsdWNreV9udW1iZXJzIDwtIGMoNywgNzcpXG5sdWNreV9udW1iZXJzIn0=

#注释

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIDIrMyJ9

帮助查询

功能包

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJpbnN0YWxsLnBhY2thZ2VzKFwiZHBseXJcIilcbmluc3RhbGwucGFja2FnZXMoXCJnZ3Bsb3QyXCIpXG5saWJyYXJ5KGRwbHlyKVxubGlicmFyeShnZ3Bsb3QyKSJ9

查看路径和设置路径

路径(工作路径)是我们读取数据和存贮结果的地方。

在自己的RStuido的 concole 输入: getwd() 查看自己的当前路径。 setwd(yourpath) 设置想要的路径。

或者使用RStudio右下方,Files这个tab里的齿轮来查看和更改。

数据类型

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENoYW5nZSBteV9udW1lcmljIHRvIGJlIDQyXG5teV9udW1lcmljIDwtIDQyXG5cbiMgQ2hhbmdlIG15X2NoYXJhY3RlciB0byBiZSBcInVuaXZlcnNlXCJcbm15X2NoYXJhY3RlciA8LSBcInVuaXZlcnNlXCJcblxuIyBDaGFuZ2UgbXlfbG9naWNhbCB0byBiZSBGQUxTRVxubXlfbG9naWNhbCA8LSBGQUxTRSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlY2xhcmUgdmFyaWFibGVzIG9mIGRpZmZlcmVudCB0eXBlczpcbm15X251bWVyaWMgPC0gNDJcbm15X2NoYXJhY3RlciA8LSBcInVuaXZlcnNlXCJcbm15X2xvZ2ljYWwgPC0gRkFMU0VcblxuY2xhc3MobXlfbnVtZXJpYylcbmNsYXNzKG15X2NoYXJhY3RlcilcbmNsYXNzKG15X2xvZ2ljYWwpIn0=

因子 factor

是不是有点晕,不着急,我们看点例子:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJleGNlbGxlbmNlPC0gYyhcImV4Y2VsbGVudFwiLCBcImJhZFwiLCBcImdvb2RcIiwgXCJva2F5XCIsIFwiYmFkXCIpXG5leGNlbGxlbmNlXG5cbmV4Y2VsbGVuY2U8LSBmYWN0b3IoZXhjZWxsZW5jZSlcbmV4Y2VsbGVuY2VcblxuZXhjZWxsZW5jZSA8LSBmYWN0b3IoZXhjZWxsZW5jZSwgb3JkZXI9VFJVRSxcbiAgICAgICAgICAgICAgICAgICAgIGxldmVscz1jKFwiYmFkXCIsIFwib2theVwiLFwiZ29vZFwiLFwiZXhjZWxsZW50XCIpKVxuZXhjZWxsZW5jZSJ9

这里我们成功的把字符型变量excellence,先转换成了无序因子变量,再转换成了有顺序的因子变量。

  • 数值型变量可以用levels和labels参数来编码成因子。如果男性被编码成1,女性被编码成2,则以下语句:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzZXg8LWMoMSwyLDIsMSwyLDEsMSwzKSBcbnNleFxuXG5zZXggPC0gZmFjdG9yKHNleCwgbGV2ZWxzPWMoMSwgMiksIGxhYmVscz1jKFwiTWFsZVwiLCBcIkZlbWFsZVwiKSlcbnNleCJ9
  • 在这个栗子里,性别被当成类别型变量,标签“Male”和“Female”替代了1和2在结果中输出,而且所有不是1或2的性别变量将被设为缺失值。

数据结构

有这么几个:

向量是用于存储数值型、字符型或逻辑型数据的一维数组。执行组合功能的函数c()可用来创建向量

注意:同一向量中无法混杂不同模式的数据。
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhIDwtIGMoMSwgMiwgNSwgMywgNiwgLTIsIDQpXG5iIDwtIGMoXCJhcHBsZVwiLCBcInBlYXJcIiwgXCJvcmFuZ2VcIilcbmMgPC0gYyhUUlVFLCBGQUxTRSwgVFJVRSwgRkFMU0UsIFRSVUUsIEZBTFNFKSJ9

通过在方括号中给定元素所处位置的数值,我们可以访问向量中的元素。例如’a[c(2)]’用于访问向量a中的第二个元素。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhIDwtIGMoMSwgMiwgNSwgMywgNiwgLTIsIDQpXG5iIDwtIGMoXCJhcHBsZVwiLCBcInBlYXJcIiwgXCJvcmFuZ2VcIilcbmMgPC0gYyhUUlVFLCBGQUxTRSwgVFJVRSwgRkFMU0UsIFRSVUUsIEZBTFNFKVxuXG5hWzNdXG5cbmFbLTNdXG5cbmJbYygxLDMpXVxuXG5jWzI6NF0ifQ==

matrix 矩阵:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJteU1hdHJpeCA8LSBtYXRyaXgoMToxNSwgbnJvdz0zLCBuY29sPTUpICBcbm15TWF0cml4In0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ5IDwtIG1hdHJpeCgxOjE4LCBucm93PTIpXG5cbnlcblxueVsyLF1cblxueVssMV1cblxueVsyLCAzXVxuXG55WzIsIGMoMyw1KV0ifQ==

dataframe 数据框

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxuc3R1ZGVudHNcbm1hdGhfc2NvcmU8LWMoMTAwLCA4MCwgNzAsIDk1KVxubWF0aF9zY29yZVxuZW5nbGlzaF9zY29yZTwtYyg5NiwgODYsIDc3LCA5OSlcbmVuZ2xpc2hfc2NvcmUifQ==
在这种情况下,使用数据框是最佳选择。
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxuc3R1ZGVudHNcbm1hdGhfc2NvcmU8LWMoMTAwLCA4MCwgNzAsIDk1KVxubWF0aF9zY29yZVxuZW5nbGlzaF9zY29yZTwtYyg5NiwgODYsIDc3LCA5OSlcbmVuZ2xpc2hfc2NvcmVcblxuc3R1ZGVudHNfc2NvcmVzPC1kYXRhLmZyYW1lKHN0dWRlbnRzLCBtYXRoX3Njb3JlLCBlbmdsaXNoX3Njb3JlKSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxubWF0aF9zY29yZTwtYygxMDAsIDgwLCA3MCwgOTUpXG5lbmdsaXNoX3Njb3JlPC1jKDk2LCA4NiwgNzcsIDk5KVxuXG5zdHVkZW50c19zY29yZXM8LWRhdGEuZnJhbWUoc3R1ZGVudHMsIG1hdGhfc2NvcmUsIGVuZ2xpc2hfc2NvcmUpXG5cbnN0dWRlbnRzX3Njb3Jlc1ssMl1cblxuc3R1ZGVudHNfc2NvcmVzWyxcIm1hdGhfc2NvcmVcIl1cblxuc3R1ZGVudHNfc2NvcmVzJG1hdGhfc2NvcmUifQ==
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHVkZW50cyA8LSBjKFwiQVwiLCBcIkJcIiwgXCJDXCIsIFwiRFwiKVxubWF0aF9zY29yZTwtYygxMDAsIDgwLCA3MCwgOTUpXG5lbmdsaXNoX3Njb3JlPC1jKDk2LCA4NiwgNzcsIDk5KVxuc3R1ZGVudHNfc2NvcmVzPC1kYXRhLmZyYW1lKHN0dWRlbnRzLCBtYXRoX3Njb3JlLCBlbmdsaXNoX3Njb3JlKVxuXG5kYXRhLmZyYW1lKHN0dWRlbnRzX3Njb3JlcyRzdHVkZW50cywgc3R1ZGVudHNfc2NvcmVzJG1hdGhfc2NvcmUpIn0=

list 列表

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhIDwtIFwiTXkgRmlyc3QgTGlzdFwiXG5iIDwtIGMoMjUsIDI2LCAxOCwgMzkpXG5jIDwtIG1hdHJpeCgxOjEwLCBucm93PTUpXG5kIDwtIGMoXCJvbmVcIiwgXCJ0d29cIiwgXCJ0aHJlZVwiKVxuXG5teWxpc3QgPC0gbGlzdCh0aXRsZT1hICxiLGMsZClcbm15bGlzdCJ9

本例创建了一个列表,其中有四个成分:一个字符串、一个数值型向量、一个矩阵以及一个字符型向量。可以组合任意多的对象,并将它们保存为一个列表。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjYXJzIDwtIGxtKGZvcm11bGEgPSB3dH5tcGcsIGRhdGEgPSBtdGNhcnMpXG5zdW1tYXJ5KGNhcnMpIn0=

常用函数

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsZW5ndGgobXRjYXJzKVxuXG5sZW5ndGgobXRjYXJzJG1wZylcblxuZGltKG10Y2Fycylcblxuc3RyKG10Y2FycylcblxuY2xhc3MobXRjYXJzKVxuXG5uYW1lcyhtdGNhcnMpIn0=
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJjKDIsIDIwKVxuXG5jYmluZChzdHVkZW50cywgbWF0aF9zY29yZSlcblxucmJpbmQoc3R1ZGVudHMsIG1hdGhfc2NvcmUpXG5cbmhlYWQobXRjYXJzKVxuXG50YWlsKG10Y2FycylcblxubHMoKVxuXG5ybShhLCBiLCBjKVxuXG5scygpIn0=

课后练习一:

请计算半径为5的圆的面积,pi取值3.14:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIzLjE0KiIsInNvbHV0aW9uIjoiMy4xNCo1XjIifQ==
请用<- 创建变量 my_height,并输出:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIENyZWF0ZSBhIHZhcmlhYmxlIG15X2hlaWdodCwgZXF1YWwgdG8gd2hhdGV2ZXIgeW91ciBoZWlnaHQgaXMsXG5cbiMgUHJpbnQgb3V0IG15X2hlaWdodCIsInNvbHV0aW9uIjoiIyBDcmVhdGUgYSB2YXJpYWJsZSBteV9oZWlnaHQsIGVxdWFsIHRvIHdoYXRldmVyIHlvdXIgaGVpZ2h0IGlzLFxubXlfaGVpZ2h0PC0xNzVcbiAgXG4jIFByaW50IG91dCBteV9oZWlnaHRcbm15X2hlaWdodCJ9
请以两位导师的名字创建向量my_teachers(提示,character的向量里面的元素需要用双引号):
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJteV90ZWFjaGVyczwtIiwic29sdXRpb24iOiJteV90ZWFjaGVyczwtYyhcIkNocmlzXCIsIFwiSmFzb25cIilcbm15X3RlYWNoZXJzIn0=
请把Jason 从 my_teachers里面召唤出来:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X3RlYWNoZXJzPC1jKFwiQ2hyaXNcIiwgXCJKYXNvblwiKSIsInNhbXBsZSI6IiMgSmFzb246Iiwic29sdXRpb24iOiIjIEphc29uOlxubXlfdGVhY2hlcnNbMl0ifQ==
请提取矩阵 my_data 第3行第5列的元素, my_data已在后台给出:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X2RhdGE8LW1hdHJpeCgxOjEwMCwgbnJvdz0xMCkiLCJzYW1wbGUiOiJteV9kYXRhIiwic29sdXRpb24iOiJteV9kYXRhWzMsIDVdIn0=
请提取矩阵 my_data 第3行第5列到8列的元素, my_data已在后台给出:
eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6Im15X2RhdGE8LW1hdHJpeCgxOjEwMCwgbnJvdz0xMCkiLCJzYW1wbGUiOiJteV9kYXRhIiwic29sdXRpb24iOiJteV9kYXRhWzMsIGMoNTo4KV0ifQ==
在自己本地电脑上的RStudio Console里用 ? 查看plot()这个函数的帮助文档,拉到最下面,赋值黏贴第一个例子,并运行:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjcGxvdCIsInNvbHV0aW9uIjoicmVxdWlyZShzdGF0cykgIyBmb3IgbG93ZXNzLCBycG9pcywgcm5vcm1cbnBsb3QoY2FycylcbmxpbmVzKGxvd2VzcyhjYXJzKSkifQ==

最后一题:

在自己电脑上建立一个文件夹R_learning, 使用setwd(“mypath”), 建立指向该文件夹的工作路径。

完结撒花~

基本作图

一图胜千言

plot(x, y), 将x置于横轴,将y置于纵轴,绘制点集(x, y),散点图。使用help(plot)可以查看其他选项。

下面的代码打开一个图形窗口并生成了一幅散点图,横轴表 示车身重量,纵轴为每加仑汽油行驶的英里数。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZykifQ==

添加横坐标标签,添加纵坐标标签

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgeGxhYj1cIk1pbGVzIFBlciBHYWxsb25cIixcbiAgICAgeWxhYj1cIkNhciBXZWlnaHRcIikifQ==

添加回归曲线

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgeGxhYj1cIk1pbGVzIFBlciBHYWxsb25cIixcbiAgICAgeWxhYj1cIkNhciBXZWlnaHRcIilcbmFibGluZShsbShtdGNhcnMkbXBnfm10Y2FycyR3dCkpXG50aXRsZShcIlJlZ3Jlc3Npb24gb2YgTVBHIG9uIFdlaWdodFwiKSJ9

改变点的颜色和形状

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KG10Y2FycyR3dCwgbXRjYXJzJG1wZyxcbiAgICAgICAgIHhsYWI9XCJNaWxlcyBQZXIgR2FsbG9uXCIsXG4gICAgICAgICB5bGFiPVwiQ2FyIFdlaWdodFwiLFxuICAgICBjb2w9NCxcbiAgICAgcGNoPTE2KVxuYWJsaW5lKGxtKG10Y2FycyRtcGd+bXRjYXJzJHd0KSlcbnRpdGxlKFwiUmVncmVzc2lvbiBvZiBNUEcgb24gV2VpZ2h0XCIpIn0=

有这些形状可以选择:

总是用美元符号,是不是太麻烦?换一种方式:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ3aXRoKG10Y2Fycyx7XG5wbG90KHd0LCBtcGcpXG5hYmxpbmUobG0obXBnfnd0KSlcbnRpdGxlKFwiUmVncmVzc2lvbiBvZiBNUEcgb24gV2VpZ2h0XCIpXG59XG4pIn0=

全局参数设定,多图同列, 例如设置2列2行, 四个直方图:

上面的图是这么实现的:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwYXIobWZyb3c9YygzLDEpKVxud2l0aChtdGNhcnMse1xuICBwYXIobWZyb3c9YygyLDIpKVxuICBoaXN0KHd0KVxuICBoaXN0KG1wZylcbiAgaGlzdChkaXNwKVxuICBoaXN0KGhwKVxufSkifQ==

箱线图:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJib3hwbG90KG10Y2FycyRtcGcpIn0=

保存图形:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwZGYoXCJteWdyYXBoLnBkZlwiKVxuICAgICAgYXR0YWNoKG10Y2FycylcbiAgICAgIHBsb3Qod3QsIG1wZylcbiAgICAgIGFibGluZShsbShtcGd+d3QpKVxuICAgICAgdGl0bGUoXCJSZWdyZXNzaW9uIG9mIE1QRyBvbiBXZWlnaHRcIilcbiAgICAgIGRldGFjaChtdGNhcnMpXG5kZXYub2ZmKCkifQ==

除了pdf(),还可以使用函数win.metafile()、png()、jpeg()、bmp()等将图形保存为其他格式。

课后练习二:

请使用str(), names()函数来观察airquality这个数据的变量:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJzdHIoKSIsInNvbHV0aW9uIjoic3RyKGFpcnF1YWxpdHkpXG5uYW1lcyhhaXJxdWFsaXR5KSJ9
summary来计算airquality数据里所有变量各自的平均值,最小值,最大值:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIGNob29zZSBhbnkgZnVuY3Rpb25zIHlvdSBsaWtlOiIsInNvbHV0aW9uIjoic3VtbWFyeShhaXJxdWFsaXR5KSJ9
请用plot()函数创建风速与风度的散点图:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwbG90KCkiLCJzb2x1dGlvbiI6IndpdGgoYWlycXVhbGl0eSx7XG4gICAgICAgcGxvdCh4PVdpbmQsIHk9VGVtcCxcbiAgICAgICAgICAgIHhsYWI9XCJ3aW5kIHNwZWVkXCIsXG4gICAgICAgICAgICB5bGFiID0gXCJUZW1wZXJhdHVyZVwiKVxuICAgIH0pIn0=
请在上图的基础上,添加回归曲线和标题“Weather in NYC”:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIGluIGFkZGl0aW9uLCB5b3UgbWF5IHRyeSB0byBpbXBsZW1lbnQgd2l0aCBmdW5jdGlvbjoiLCJzb2x1dGlvbiI6IndpdGgoYWlycXVhbGl0eSx7XG4gICAgICAgcGxvdCh4PVdpbmQsIHk9VGVtcCxcbiAgICAgICAgICAgIHhsYWI9XCJ3aW5kIHNwZWVkXCIsXG4gICAgICAgICAgICB5bGFiID0gXCJUZW1wZXJhdHVyZVwiKVxuICBhYmxpbmUobG0oV2luZH5UZW1wKSlcbiAgdGl0bGUoXCJXZWF0aGVyIGluIE5ZQ1wiKVxuICAgIH0pIn0=

最后一题,请使用pdf("mygraph.pdf")pdf("mygraph.pdf")将上面的图形保存到你的作业文件夹(本地硬盘)。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ3aXRoKGFpcnF1YWxpdHkse1xuICAgICAgIHBsb3QoeD1XaW5kLCB5PVRlbXAsXG4gICAgICAgICAgICB4bGFiPVwid2luZCBzcGVlZFwiLFxuICAgICAgICAgICAgeWxhYiA9IFwiVGVtcGVyYXR1cmVcIilcbiAgYWJsaW5lKGxtKFdpbmR+VGVtcCkpXG4gIHRpdGxlKFwiV2VhdGhlciBpbiBOWUNcIilcbiAgICB9KSJ9

读取数据

读取数据是数据分析的第一步,应该 so easy!

然而现实并非如此,因为真实的世界里,数据的格式百花齐放:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJhZDwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL202amg1a3NwaWFubTIxNS9hZHZlcnRpc2luZy5jc3Y/ZGw9MVwiKSJ9

许多数据以flat files形式出现: simple tabular text files. 我们首先学习 how to read CSV and text files in R。 需要用到package utils里面的read.csv()函数,是R自带的。

我们来使用swimming_pools.csv做练习。it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au).

可以使用我提供的dropbox链接下载数据到自己的本地硬盘,然后将其路径输入到read.csv(),或者直接使用dropbox链接读取。

read.csv() 默认将string 读成factor,可以通过设定选项stringsAsFactors = FALSE来修改。

看看结果有什么不同:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJwb29sczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3VuYnlxZnNwYjk1cnNxbS9zd2ltbWluZ19wb29scy5jc3Y/ZGw9MVwiKVxuc3RyKHBvb2xzKVxuXG5wb29sczwtcmVhZC5jc3YoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zL3VuYnlxZnNwYjk1cnNxbS9zd2ltbWluZ19wb29scy5jc3Y/ZGw9MVwiLCBzdHJpbmdzQXNGYWN0b3JzID0gRkFMU0UpXG5cbnN0cihwb29scykifQ==

我们再学习一下如何读取文本数据。 .txt files which are basically text files

使用的函数是read.delim(),默认是数据点分隔方式是 \t。(fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names)。

练习读取数据 hotdogs.txt, containing information on sodium and calorie levels in different hotdogs (Source: UCLA)。

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJob3Rkb2dzPC1yZWFkLmRlbGltKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy8waGJyYXFjcGw0ZHc4dzgvaG90ZG9ncy50eHQ/ZGw9MVwiLCBzZXA9XCJcXHRcIixoZWFkZXIgPSBUUlVFKVxuIyBTdW1tYXJpemUgaG90ZG9nc1xuc3VtbWFyeShob3Rkb2dzKSJ9

使用read.table()来通吃各种非正常数据,需要具体设置各种参数。 具体的我们来看个例子,依旧用hotdogs数据:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIFBhdGggdG8gdGhlIGhvdGRvZ3MudHh0IGZpbGU6IHBhdGhcbnBhdGggPC0gZmlsZS5wYXRoKFwiaHR0cHM6Ly93d3cuZHJvcGJveC5jb20vcy8waGJyYXFjcGw0ZHc4dzgvaG90ZG9ncy50eHQ/ZGw9MVwiKVxuXG4jIEltcG9ydCB0aGUgaG90ZG9ncy50eHQgZmlsZTogaG90ZG9nc1xuaG90ZG9ncyA8LSByZWFkLnRhYmxlKHBhdGgsIFxuICAgICAgICAgICAgICAgICAgICAgIHNlcCA9IFwiXFx0XCIsIFxuICAgICAgICAgICAgICAgICAgICAgIGNvbC5uYW1lcyA9IGMoXCJ0eXBlXCIsIFwiY2Fsb3JpZXNcIiwgXCJzb2RpdW1cIikpXG5cbiMgQ2FsbCBoZWFkKCkgb24gaG90ZG9nc1xuaGVhZChob3Rkb2dzKSJ9

有两个好朋友,李雷和韩梅梅想分享一个hotdog,但是在到底吃哪一个上面,他们意见不合。所以决定各自吃一个,李雷想要卡路里最少的,韩梅梅想吃含盐量最少的,我们来帮他们选一下吧!

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpbmlzaCB0aGUgcmVhZC5kZWxpbSgpIGNhbGxcbiMgUGF0aCB0byB0aGUgaG90ZG9ncy50eHQgZmlsZTogcGF0aFxucGF0aCA8LSBmaWxlLnBhdGgoXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIpXG5cbiMgSW1wb3J0IHRoZSBob3Rkb2dzLnR4dCBmaWxlOiBob3Rkb2dzXG5ob3Rkb2dzIDwtIHJlYWQudGFibGUocGF0aCwgXG4gICAgICAgICAgICAgICAgICAgICAgc2VwID0gXCJcXHRcIiwgXG4gICAgICAgICAgICAgICAgICAgICAgY29sLm5hbWVzID0gYyhcInR5cGVcIiwgXCJjYWxvcmllc1wiLCBcInNvZGl1bVwiKSlcblxuIyBTZWxlY3QgdGhlIGhvdCBkb2cgd2l0aCB0aGUgbGVhc3QgY2Fsb3JpZXM6IGxpbHlcbkxpX0xlaSA8LSBob3Rkb2dzW3doaWNoLm1pbihob3Rkb2dzJGNhbG9yaWVzKSwgXVxuXG4jIFNlbGVjdCB0aGUgb2JzZXJ2YXRpb24gd2l0aCB0aGUgbW9zdCBzb2RpdW06IHRvbVxuSGFuX01laW1laSA8LSBob3Rkb2dzW3doaWNoLm1pbihob3Rkb2dzJHNvZGl1bSksIF1cblxuXG4jIFByaW50IExpX0xlaSwgSGFuX01laW1laVxuXG5MaV9MZWlcbkhhbl9NZWltZWkifQ==

单独设置每一个变量的数据类型。 如果我有好几个string类型的变量在我的数据里,我想要有的是string,有的是logical,还有的是factor,怎么破?

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJob3Rkb2dzIDwtIHJlYWQuZGVsaW0oXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIsIGhlYWRlciA9IEZBTFNFLCBjb2wubmFtZXMgPSBjKFwidHlwZVwiLCBcImNhbG9yaWVzXCIsIFwic29kaXVtXCIpKVxuXG4jIERpc3BsYXkgc3RydWN0dXJlIG9mIGhvdGRvZ3NcbnN0cihob3Rkb2dzKVxuXG4jIEVkaXQgdGhlIGNvbENsYXNzZXMgYXJndW1lbnQgdG8gaW1wb3J0IHRoZSBkYXRhIGNvcnJlY3RseTogaG90ZG9nczJcbmhvdGRvZ3MyIDwtIHJlYWQuZGVsaW0oXCJodHRwczovL3d3dy5kcm9wYm94LmNvbS9zLzBoYnJhcWNwbDRkdzh3OC9ob3Rkb2dzLnR4dD9kbD0xXCIsIGhlYWRlciA9IEZBTFNFLCBcbiAgICAgICAgICAgICAgICAgICAgICAgY29sLm5hbWVzID0gYyhcInR5cGVcIiwgXCJjYWxvcmllc1wiLCBcInNvZGl1bVwiKSxcbiAgICAgICAgICAgICAgICAgICAgICAgY29sQ2xhc3NlcyA9IGMoXCJmYWN0b3JcIiwgXCJOVUxMXCIsIFwibnVtZXJpY1wiKSlcblxuIyBEaXNwbGF5IHN0cnVjdHVyZSBvZiBob3Rkb2dzMlxuc3RyKGhvdGRvZ3MyKSJ9

编写函数

到目前为止你已经接触并使用了很多R的函数,现在我们要学习编写函数?为什么要编写函数,不是已经有很多现成的吗?再说,我又不编写r packages。我们学习编写函数有两个原因:

  1. make your code more readable, avoid coding errors
  2. automate repetitive tasks

函数的模版一般长这个样子: my_fun <- function(arg1, arg2) { # body }

  1. my_fun is the variable that you want to assign your function to
  2. arg1 and arg2 are arguments to the function. The template has two arguments, but you can specify any number of arguments, each separated by a comma.
  3. You then replace # body with the R code that your function will execute, referring to the inputs by the argument names you specified.

来,我们编写一个计算圆面积的函数 size():

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBzaXplKCkgZnVuY3Rpb25cbm15X2Z1biA8LSBmdW5jdGlvbihhcmcxKSB7XG4gICMgYm9keVxufVxuXG4jIENhbGwgc2l6ZSgpIHdpdGggYXJndW1lbnQgNSJ9
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJkZjwtZGF0YS5mcmFtZShcbiAgYSA9IHJub3JtKDEwKSxcbiAgYiA9IHJub3JtKDEwKSxcbiAgYyA9IHJub3JtKDEwKSxcbiAgZCA9IHJub3JtKDEwKVxuKVxuIyBDcmVhdGUgbmV3IGRvdWJsZSB2ZWN0b3I6IG91dHB1dFxuXG5cbiMgQWx0ZXIgdGhlIGxvb3BcbmZvciAoaSBpbiBzZXFfYWxvbmcoZGYpKSB7XG4gICMgQ2hhbmdlIGNvZGUgdG8gc3RvcmUgcmVzdWx0IGluIG91dHB1dFxuICBwcmludChtZWRpYW4oZGZbW2ldXSkpXG59XG5cbiMgUHJpbnQgb3V0cHV0Iiwic29sdXRpb24iOiJkZjwtZGF0YS5mcmFtZShcbiAgYSA9IHJub3JtKDEwKSxcbiAgYiA9IHJub3JtKDEwKSxcbiAgYyA9IHJub3JtKDEwKSxcbiAgZCA9IHJub3JtKDEwKVxuKVxuIyBDcmVhdGUgbmV3IGRvdWJsZSB2ZWN0b3I6IG91dHB1dFxub3V0cHV0IDwtIHZlY3RvcihcImRvdWJsZVwiLCBuY29sKGRmKSlcblxuIyBBbHRlciB0aGUgbG9vcFxuZm9yIChpIGluIHNlcV9hbG9uZyhkZikpIHtcbiAgIyBDaGFuZ2UgY29kZSB0byBzdG9yZSByZXN1bHQgaW4gb3V0cHV0XG4gIG91dHB1dFtbaV1dIDwtIG1lZGlhbihkZltbaV1dKVxufVxuXG4jIFByaW50IG91dHB1dFxub3V0cHV0In0=

注意: 1. match arguments by positions, by names? by name is always perferred, especially when you have more than 2 arguments, or the arguments have default values. 2. when you call a function, you should place a space around = in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.

我们说过,函数可以帮助我们避免重复劳动同时减少出错,来,看个例子: 一个10*4 的数据集已经预备好了,我们要计算每一列的中位数,并把结果存起来。 编写完成这个任务的函数,我们要 1. 写一个空白的向量,以备存储计算结果 2. 用 seq_along() 来提取待计算的数据集的长度 3. 使用for loop 来迭代(从第一行计算到最后一行)

一步一步,我们来搭建另一个函数rescale01可以把一个数据集的每一列数据都变成0-1之间:

(df$a - min(df$a, na.rm = TRUE)) /(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))

首先,确定需要哪些输入inputs?很明显是df$a,简便起见,就叫它x吧。

定义一个示意的向量,我们可以把上面的一段代码写成这样:
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTBcblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB4XG4oeCAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSAvXG4gIChtYXgoeCwgbmEucm0gPSBUUlVFKSAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSJ9

庖丁解牛,我们看看函数的核心里面都有什么,有重复出现的内容吗?我们可不可以简化一下?

  1. One obviously duplicated statement is min(x, na.rm = TRUE). It makes more sense for us just to calculate it once, store the result, and then refer to it when needed.
  2. In fact, since we also need the maximum value of x, it would be even better to calculate the range once, then refer to the first and second elements when they are needed.

具体我们这么操作:

  1. Define the intermediate variable rng to contain the range of x using the function range().
  2. Specify the na.rm() argument to automatically ignore any NAs in the vector.
  3. Rewrite the snippet to refer to this intermediate variable.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTBcblxuIyBEZWZpbmUgcm5nXG5cblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB0aGUgZWxlbWVudHMgb2Ygcm5nXG4oeCAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKSAvXG4gIChtYXgoeCwgbmEucm0gPSBUUlVFKSAtIG1pbih4LCBuYS5ybSA9IFRSVUUpKVxuXG4jIFVzZSB0aGUgZnVuY3Rpb24gdGVtcGxhdGUgdG8gY3JlYXRlIHRoZSByZXNjYWxlMDEgZnVuY3Rpb25cblxuIyBUZXN0IHlvdXIgZnVuY3Rpb24sIGNhbGwgcmVzY2FsZSB1c2luZyB0aGUgdmVjdG9yIHggYXMgdGhlIGFyZ3VtZW50Iiwic29sdXRpb24iOiIjIERlZmluZSBleGFtcGxlIHZlY3RvciB4XG54IDwtIDE6MTAgXG5cbiMgRGVmaW5lIHJuZ1xucm5nIDwtIHJhbmdlKHgsIG5hLnJtID0gVFJVRSlcblxuIyBSZXdyaXRlIHRoaXMgc25pcHBldCB0byByZWZlciB0byB0aGUgZWxlbWVudHMgb2Ygcm5nXG4oeCAtIHJuZ1sxXSkgLyBcbiAgKHJuZ1syXSAtIHJuZ1sxXSlcblxuIyBVc2UgdGhlIGZ1bmN0aW9uIHRlbXBsYXRlIHRvIGNyZWF0ZSB0aGUgcmVzY2FsZTAxIGZ1bmN0aW9uXG5yZXNjYWxlMDEgPC0gZnVuY3Rpb24oeCkge1xuICBybmcgPC0gcmFuZ2UoeCwgbmEucm0gPSBUUlVFKVxuICAoeCAtIHJuZ1sxXSkgLyAocm5nWzJdIC0gcm5nWzFdKVxufVxuXG4jIFRlc3QgeW91ciBmdW5jdGlvbiwgY2FsbCByZXNjYWxlIHVzaW5nIHRoZSB2ZWN0b3IgeCBhcyB0aGUgYXJndW1lbnRcbnJlc2NhbGUwMSh4KSJ9