markdown.utf8

Tutorial of Intro to Data Wrangling with Python & R

Author: Haoxiang Qi（齐浩翔）
Date: Feb 8, 2020

前言

我自己也才学R一周多，内容肯定有很多疏漏错误，能力也相对有限，还请大家多多包容。

学习之前建议通过Google或者前辈明确：

作为非CS专业的商学院学生，我为什么要学习编程？
它们创造的价值是否大于我的时间成本？
我是否是个有耐心且能静下心的人？
它们在哪些领域分别有什么优势？

之前一直是使用Python来做数据分析和可视化，Python的可视化库无论是Matplotlib还是Seaborn都让我有些头疼。由于这学期3门课都是以R语言为基础，我于上周开始第一次接触R语言。虽然一开始有些用着不顺手，但是ggplot2确实好用得抵消了我的这些顾虑。考虑到大家因为教授的授课节奏而困扰，这里我也就顺便梳理了一下Python和R的数据处理的基本操作。

常用包

常用入门R包：reticulate, fpp2(自带ggplot2), Rmisc, dplyr, tidyr
使用install.packages()安装

常用入门Python包： pandas(自带numpy), re, dfply(自带pipe), seaborn(自带matplotlib)
打开Anaconda-Navigator>Environments>下拉菜单选中ALL>搜索并一次勾选包>点击Apply安装 (若失败还可以使用pip安装)
screenshot1

环境配置

之所以选择RStuido作为IDE，是因为这是目前为止最适合Python和R同时操作的IDE了。虽然Jupyter Notebook 远比RStuido的Markdown方便编辑，但是RStuido简洁方便的包管理，高度契合ggplot绘图的界面，Python和R无缝的变量传递，都让我最后选择了它。如果只想单独学习R或者Python的同学，可以跳过这些操作。

Python和R环境安装教程：
1. 先安装Anaconda, 再单独下载并按顺序安装R和R studio
2. 打开R studtio 并新建R Markdown文件（并按提示安装所需包）
3. 建议安装下面提到的所有包
4. 新建markdown文件 screenshot1
5. 删除所有内容，并新建一个R的chunk，在其中输入以下内容并运行（以后每次重启RStudio后都需要运行一次）

6. 无报错后在use_condaenv(“Anaconda”)前加上引号注释掉
7. 随后即可开始在不同的chunk中同时使用R和Python了

Hints：
Python中调用R变量为r.*
R中调用Python变量为py$*

倒入包

library(reticulate)
# use_condaenv("Python_R") # 使用一次即可，输出html时需要注释掉

# 第一次加载R包
library(ggplot2)
library(tidyr)
library(dplyr)
library(Rmisc)

# 第一次加载Python包
import pandas as pd
import numpy as np
from dfply import *
import re

基础语法

列表

Python

# 这些基础语法强烈建议系统购买书籍或者网课学习更多细节

# 列表用来存储各类element, 相比与元组，列表更加灵活，可以自由删减
list_example = ["a",  "b"]
# 读取第一个element，Python第一个元素的索引是0
list_example[0]

## 'a'

# 这些基础语法强烈建议系统购买书籍或者网课学习更多细节

list_example <- c("a",  "b")

# 读取第一个element，R第一个元素的索引是1
list_example[1]

## [1] "a"

字典

Python

# 字典用于存储配对元素，一个key和一个value一一对应，value可以是字符串，列表，元组。字典可以用于手动录入数据。

dict_example = {"x" : ["a", "b", "c"],
                "y" : ["d", "e", "f"]}
dict_example.keys()

## dict_keys(['x', 'y'])

R中没有字典这个概念

函数

Python

# 定义新函数是非常重要的，它可以帮助我们减少重复步骤，使代码更简洁更易维护
# 我们通过执行最后一行代码，使得数字1和数字2作为参数被传入新函数中，通过一系列运算最后返回一个输出值

def sum_function(param1, param2):    # 定义符号 新函数名(参数1，参数2):
  sum_number = param1 + param2       #   变量 = 参数1 + 参数2
  return sum_number                  #   返回 变量
  
sum_function(1, 2)                   # 新函数名(数字1，数字2)

## 3

# R语言的函数与Python的类似

sum_function <- function(param1, param2)
  {
  sum_number = param1 + param2
  sum_number
}

sum_function(1,2)

## [1] 3

if elif else; while; for; with; try except等这些简单的条件语句，还请大家自行学习

数据处理

这里演示用的主要是R语言的diamonds数据集

管道

Python中的dfply包，和R中的dplyr包都带有这一功能。
pipe管道是一种简化数据处理流程的手段，它通过符号将数据传递入下一个function中进行处理，以实现简化代码节约内存的目的。
Python和R的管道使用略有不同，Python使用>>，R使用%>%。后面也会经常使用，有兴趣可以去了解%T>%, %<>%, %$%等

筛选

Python

# 也可以使用query()或者布尔索引
# 选择cut为Good，价格小于337的行
r.diamonds >> mask(X.cut == "Good", X.price < 337)

##    carat   cut color clarity  depth  table  price     x     y     z
## 2   0.23  Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31
## 4   0.31  Good     J     SI2   63.3   58.0    335  4.34  4.35  2.75

diamonds %>% filter(cut == "Good") %>% head(3)

## # A tibble: 3 x 10
##   carat cut   color clarity depth table price     x     y     z
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31
## 2  0.31 Good  J     SI2      63.3    58   335  4.34  4.35  2.75
## 3  0.3  Good  J     SI1      64      55   339  4.25  4.28  2.73

选择列

Python

# diamonds[["cut", "color"]]
# diamonds >> select(starts_with("c")) >> head(3) 检索以c开头的列
diamonds >> select("cut", "color") >> head(3)

##        cut color
## 0    Ideal     E
## 1  Premium     E
## 2     Good     E

# diamonds %>% select(starts_with("c")) %>% head(3)
diamonds %>% select(cut, color) %>% head(3)

## # A tibble: 3 x 2
##   cut     color
##   <ord>   <ord>
## 1 Ideal   E    
## 2 Premium E    
## 3 Good    E

取样

Python

 # 取前3行
diamonds.sample(3) >> head(3)

##        carat      cut color clarity  depth  table  price     x     y     z
## 40767    0.4    Ideal     E    VVS2   62.4   59.0   1166  4.70  4.73  2.94
## 31983    0.3    Ideal     F     VS2   62.7   57.0    776  4.31  4.27  2.69
## 29847    0.3  Premium     D     VS2   61.9   58.0    710  4.28  4.32  2.66

# R语言的随机抽行是sample_n()，随机抽列是sample()
diamonds %>% sample(3) %>% head(3)

## # A tibble: 3 x 3
##   clarity cut         x
##   <ord>   <ord>   <dbl>
## 1 SI2     Ideal    3.95
## 2 SI1     Premium  3.89
## 3 VS1     Good     4.05

删除列

Python

# 删去color，table列
diamonds.drop(columns=["color", "table"]).head(3)

##    carat      cut clarity  depth  price     x     y     z
## 0   0.23    Ideal     SI2   61.5    326  3.95  3.98  2.43
## 1   0.21  Premium     SI1   59.8    326  3.89  3.84  2.31
## 2   0.23     Good     VS1   56.9    327  4.05  4.07  2.31

# -符号代表反选，这里反选列等于删除
diamonds %>% select(-color, -table) %>% head(3)

## # A tibble: 3 x 8
##   carat cut     clarity depth price     x     y     z
##   <dbl> <ord>   <ord>   <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   SI2      61.5   326  3.95  3.98  2.43
## 2  0.21 Premium SI1      59.8   326  3.89  3.84  2.31
## 3  0.23 Good    VS1      56.9   327  4.05  4.07  2.31

新建计算列

Python

# 新建计算列x+列y和列x*列y*列z
# transmute 只展示计算列; mutate 展示所有列
# diamonds >> transmute(x_plus_y=X.x + X.y, xyz=X.x*X.y*X.z) >> head(3) 
diamonds >> mutate(x_plus_y=X.x + X.y, xyz=X.x*X.y*X.z) >> head(3)

##    carat      cut color clarity  depth  ...     x     y     z  x_plus_y        xyz
## 0   0.23    Ideal     E     SI2   61.5  ...  3.95  3.98  2.43      7.93  38.202030
## 1   0.21  Premium     E     SI1   59.8  ...  3.89  3.84  2.31      7.73  34.505856
## 2   0.23     Good     E     VS1   56.9  ...  4.05  4.07  2.31      8.12  38.076885
## 
## [3 rows x 12 columns]

# diamonds %>% mutate(x_plus_y = x + y, xyz = x*y*z) 展示全部列
diamonds %>% transmute(x_plus_y = x + y, xyz = x*y*z) %>% head(3)

## # A tibble: 3 x 2
##   x_plus_y   xyz
##      <dbl> <dbl>
## 1     7.93  38.2
## 2     7.73  34.5
## 3     8.12  38.1

映射分级

Python

# 编写分类函数
def cal_level(p):
  if p > 2000:
    return "A"
  elif 1500 < p <= 2000:
    return "B"
  elif 1000 < p <= 1500:
    return "C"
  else:
    return "D"

new_diamonds = r.diamonds

# 这里的map()使得price列每个价格都会被映射进定义的新函数，并输出一个level，最后生成一列计算列
new_diamonds["price_level"] = new_diamonds.price.map(cal_level) 
new_diamonds.head(3)

##    carat      cut color clarity  depth  ...  price     x     y     z  price_level
## 0   0.23    Ideal     E     SI2   61.5  ...    326  3.95  3.98  2.43            D
## 1   0.21  Premium     E     SI1   59.8  ...    326  3.89  3.84  2.31            D
## 2   0.23     Good     E     VS1   56.9  ...    327  4.05  4.07  2.31            D
## 
## [3 rows x 11 columns]

new_df <- diamonds %>% mutate(price_class = case_when(price > 2000 ~ 'A',
                                            between(price, 1500, 2000) ~ 'B',
                                            between(price, 1000, 1500) ~ 'C',
                                            TRUE ~ 'D')) # between取闭区间[a,b]
new_df %>% head(3)

## # A tibble: 3 x 11
##   carat cut     color clarity depth table price     x     y     z price_class
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>      
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43 D          
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31 D          
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31 D

上下粘合列/行

Python

# 在列表中的table可以被粘合成新的完整表
diamonds1 = diamonds.head(2)
diamonds2 = diamonds.tail(3)
diamonds3 = diamonds.sample(3)
pd.concat([diamonds1, diamonds2, diamonds3], axis=0) # axis=1为贴合列

##        carat        cut color clarity  depth  table  price     x     y     z
## 0       0.23      Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1       0.21    Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
## 53937   0.70  Very Good     D     SI1   62.8   60.0   2757  5.66  5.68  3.56
## 53938   0.86    Premium     H     SI2   61.0   58.0   2757  6.15  6.12  3.74
## 53939   0.75      Ideal     D     SI2   62.2   55.0   2757  5.83  5.87  3.64
## 3642    0.90  Very Good     G     SI2   64.2   60.0   3437  6.02  6.09  3.89
## 34992   0.35      Ideal     E    VVS2   61.3   56.0    881  4.54  4.60  2.80
## 20848   1.50       Good     E     SI2   63.9   58.0   9072  7.26  7.21  4.62

diamonds1 = diamonds %>% select(carat, cut, color)
diamonds2 = diamonds %>% select(x, y, z)
# 使用 bind_cols()一样可以实现
diamonds1 %>% bind_cols(diamonds2) %>% head(3)

## # A tibble: 3 x 6
##   carat cut     color     x     y     z
##   <dbl> <ord>   <ord> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E      3.95  3.98  2.43
## 2  0.21 Premium E      3.89  3.84  2.31
## 3  0.23 Good    E      4.05  4.07  2.31

表连结

Python

# 类似SQL用法，具体可以Google下各种join的区别。有机会也可以学习下SQL，也是非常重要的一门语言

a = pd.DataFrame({
  'x1':['A','B','C'],
  'x2':[1,2,3]
}) # 新建一个dataframe
a

##   x1  x2
## 0  A   1
## 1  B   2
## 2  C   3

b = pd.DataFrame({
  'x1':['A','B','D'],
  'x3':[True,False,True]
}) # 再新建一个dataframe
b
# a.merge(b, "inner", on='x1')
# 保留a表中的所有内容，以a表中x1列为准，将b表的x3列数据插入。不匹配则舍弃，未匹配则为空

##   x1     x3
## 0  A   True
## 1  B  False
## 2  D   True

a.merge(b, "left", on='x1')

##   x1  x2     x3
## 0  A   1   True
## 1  B   2  False
## 2  C   3    NaN

# right_join(py$a, py$b, 'x1')
inner_join(py$a, py$b, by='x1')

##   x1 x2    x3
## 1  A  1  TRUE
## 2  B  2 FALSE

排序

Python

# 当前降序排列，参数默认升序排列
diamonds.sort_values("price", ascending=False).head(3)

##        carat        cut color clarity  depth  table  price     x     y     z
## 27749   2.29    Premium     I     VS2   60.8   60.0  18823  8.50  8.47  5.16
## 27748   2.00  Very Good     G     SI1   63.5   56.0  18818  7.90  7.97  5.04
## 27747   1.51      Ideal     G      IF   61.7   55.0  18806  7.37  7.41  4.56

diamonds %>% arrange(price) %>% head(3)

## # A tibble: 3 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31

diamonds %>% arrange(desc(price)) %>% head(3)

## # A tibble: 3 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
## 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
## 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56

改列名

Python

# 列cut改为列CUT
# diamonds.rename(columns={"cut":"CUT", "price":"PRICE"}).head(3)
diamonds >> rename(CUT = X.cut) >> head(2)

##    carat      CUT color clarity  depth  table  price     x     y     z
## 0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
## 1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31

diamonds %>% dplyr::rename(CUT = cut) %>% head(2) # dplyr和其他包的func有冲突

## # A tibble: 2 x 10
##   carat CUT     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31

排序

Python

# row_number, min_rank, dense_rank 三种不同的排序方式
diamonds = r.diamonds
# diamonds >> mutate(price_rank = row_number(X.price)) # 以价格不重复排序
# diamonds >> mutate(price_rank = min_rank(X.price)) # 以价格不重复降序排序
rank = diamonds >> mutate(price_rank = dense_rank(X.price))
rank.price_rank = rank.price_rank.astype(int)
rank.head(3)

##    carat      cut color clarity  depth  ...  price     x     y     z  price_rank
## 0   0.23    Ideal     E     SI2   61.5  ...    326  3.95  3.98  2.43           1
## 1   0.21  Premium     E     SI1   59.8  ...    326  3.89  3.84  2.31           1
## 2   0.23     Good     E     VS1   56.9  ...    327  4.05  4.07  2.31           2
## 
## [3 rows x 11 columns]

# diamonds %>% mutate(price_rank = min_rank(price))
# diamonds %>% mutate(price_rank = dense_rank(price))
diamonds %>% mutate(price_rank = row_number(price)) %>% head(3)

## # A tibble: 3 x 11
##   carat cut     color clarity depth table price     x     y     z price_rank
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43          1
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31          2
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31          3

窗口移动

Python

# 将数据的某一列向下或向上移动若干位
diamonds >> mutate(lag_price = X.price.shift(4)) >> head(5)

##    carat      cut color clarity  depth  ...  price     x     y     z  lag_price
## 0   0.23    Ideal     E     SI2   61.5  ...    326  3.95  3.98  2.43        NaN
## 1   0.21  Premium     E     SI1   59.8  ...    326  3.89  3.84  2.31        NaN
## 2   0.23     Good     E     VS1   56.9  ...    327  4.05  4.07  2.31        NaN
## 3   0.29  Premium     I     VS2   62.4  ...    334  4.20  4.23  2.63        NaN
## 4   0.31     Good     J     SI2   63.3  ...    335  4.34  4.35  2.75      326.0
## 
## [5 rows x 11 columns]

# 后移
diamonds %>% mutate(price_lead = lag(price,2)) %>% head(3)

## # A tibble: 3 x 11
##   carat cut     color clarity depth table price     x     y     z price_lead
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>      <int>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43         NA
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31         NA
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31        326

# 累加
diamonds %>% mutate(cumsum_carat = cumsum(carat)) %>% head(3)

## # A tibble: 3 x 11
##   carat cut     color clarity depth table price     x     y     z cumsum_carat
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>        <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43         0.23
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31         0.44
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31         0.67

#还有cummean()、cummax()、cummin()以及cumprod()

长转宽数据

Python

# 展开信息
long_data = pd.DataFrame({
  'Player':['Player1']*3 + ['Player2']*3 + ['Player3']*3,
  'Introduction':['name','education', 'sex']*3,
  'Message': ['Sulie', 'master', 'male', 'LuBan', 'Bachelor', 'male', 'ZhenJi', 'PhD', 'female']
})
long_data

##     Player Introduction   Message
## 0  Player1         name     Sulie
## 1  Player1    education    master
## 2  Player1          sex      male
## 3  Player2         name     LuBan
## 4  Player2    education  Bachelor
## 5  Player2          sex      male
## 6  Player3         name    ZhenJi
## 7  Player3    education       PhD
## 8  Player3          sex    female

long_data >> spread(X.Introduction, X.Message)

##     Player education    name     sex
## 0  Player1    master   Sulie    male
## 1  Player2  Bachelor   LuBan    male
## 2  Player3       PhD  ZhenJi  female

short_data = py$long_data %>% pivot_wider(id_cols = Player, names_from = Introduction,
                             values_from = Message)
short_data

## # A tibble: 3 x 4
##   Player  name   education sex   
##   <chr>   <chr>  <chr>     <chr> 
## 1 Player1 Sulie  master    male  
## 2 Player2 LuBan  Bachelor  male  
## 3 Player3 ZhenJi PhD       female

宽转长数据

Python

# 合并信息
r.short_data >> gather('Introduction', 'Message', ['name', 'sex', 'education']) >> head(3)

##     Player Introduction Message
## 0  Player1         name   Sulie
## 1  Player2         name   LuBan
## 2  Player3         name  ZhenJi

short_data %>% pivot_longer(-Player, names_to = 'Introduction',
                              values_to = 'Message') # 切记有-符号

## # A tibble: 9 x 3
##   Player  Introduction Message 
##   <chr>   <chr>        <chr>   
## 1 Player1 name         Sulie   
## 2 Player1 education    master  
## 3 Player1 sex          male    
## 4 Player2 name         LuBan   
## 5 Player2 education    Bachelor
## 6 Player2 sex          male    
## 7 Player3 name         ZhenJi  
## 8 Player3 education    PhD     
## 9 Player3 sex          female

填充空值

Python

# 使用fillna()也可以填充，但因为整列是聚合导致准确度不好。使用此法可以提高填充的准确程度
# 替换为任意NULL值以模拟数据缺失/异常
diamonds_use = diamonds.tail(100)[["cut", "price"]].replace(2757, np.nan)
diamonds_use

##              cut   price
## 53840       Good  2738.0
## 53841  Very Good  2739.0
## 53842      Ideal  2739.0
## 53843      Ideal  2739.0
## 53844  Very Good  2740.0
## ...          ...     ...
## 53935      Ideal     NaN
## 53936       Good     NaN
## 53937  Very Good     NaN
## 53938    Premium     NaN
## 53939      Ideal     NaN
## 
## [100 rows x 2 columns]

mean_price_groupby = diamonds_use.groupby('cut').mean() # 以cut列聚合获得平均值
# 使用pivot也可
# mean_price_pivot = diamonds_use.pivot_table(values='price', index='cut')

def impute_price(cols):
  price = cols[1]
  cut = cols[0]
  if pd.isnull(price):
    return mean_price_groupby['price'][mean_price_groupby.index == cut][0]
    # 使用pivot也可
    # return mean_price_pivot['price'][mean_price_pivot.index == cut][0]
  else:
    return price

diamonds_use.price = diamonds_use[["cut", "price"]].apply(impute_price, axis=1)
diamonds_use

##              cut        price
## 53840       Good  2738.000000
## 53841  Very Good  2739.000000
## 53842      Ideal  2739.000000
## 53843      Ideal  2739.000000
## 53844  Very Good  2740.000000
## ...          ...          ...
## 53935      Ideal  2747.838710
## 53936       Good  2750.333333
## 53937  Very Good  2747.920000
## 53938    Premium  2748.125000
## 53939      Ideal  2747.838710
## 
## [100 rows x 2 columns]

字符串处理

Python（这里略过R, 因为字符串处理是Python的强项）

常用自带字符串函数

text = "  Item quantity " 
text

## '  Item quantity '

text = text.strip() # 删去两侧空格
text

## 'Item quantity'

text = text.replace(" ", "_") # 替换任意字符为另一字符
text

## 'Item_quantity'

diamonds.columns = diamonds.columns.map(lambda x: x.title()) # 使用map()映射进指定列并使所有字符串首字母大写
diamonds.head(1)  # upper()全大写; lower()全小写; title()首字母大写

##    Carat    Cut Color Clarity  Depth  Table  Price     X     Y     Z
## 0   0.23  Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43

正则表达式re模块
re是数据清洗的核心模块之一，也是许多爬虫脚本得以正常运行的基本。
不过正则表达式较为复杂，这里不多赘述。
有兴趣可以点击链接学习，学明白之后用处很多，并且R和SQL的正则表达式也都相似。
https://www.liaoxuefeng.com/wiki/1016959663602400/1017639890281664

text = "tutorial*%qihaoxiang1@Python3-*R6"
re.findall(r'\w{6,10}\d{1}', text) # 返回列表形式的匹配字符

## ['qihaoxiang1', 'Python3']

new_diamonds = diamonds.head(5)
new_diamonds["Cut"].map(lambda x: str(x) + "123") # 添加数字以模拟数据异常

## 0      Ideal123
## 1    Premium123
## 2       Good123
## 3    Premium123
## 4       Good123
## Name: Cut, dtype: category
## Categories (5, object): [Fair123 < Good123 < Very Good123 < Premium123 < Ideal123]

new_diamonds["Cut"].map(lambda x: re.split(r'\d+', x)[0]) # 去除掉diamonds中的数字

## 0      Ideal
## 1    Premium
## 2       Good
## 3    Premium
## 4       Good
## Name: Cut, dtype: category
## Categories (5, object): [Fair < Good < Very Good < Premium < Ideal]

总结：

我的学习顺序：了解大致功能 >> 购买书籍或者网课学习基础语法 >> 实操 >> 结合官方文档学习各类第三方包 >> 总结梳理形成知识体系 >> 尝试应用新知识输出分析报告 >> 阅读Github，文章里浏览优秀分析报告和代码

这是一个大致的顺序，细节还是因人而异的。在上面的包熟练之后，可以着手尝试更丰富的第三方包，可以进行机器学习，深度学习，爬虫，绘制各类图表，连接数据库等一系列进阶操作。

pandas官方文档：https://pandas.pydata.org/pandas-docs/stable/reference/index.html
ggplot2官方文档：https://ggplot2.tidyverse.org/reference/index.html
dplyr官方文档：https://dplyr.tidyverse.org/reference/index.html

附录: nCov疫情可视化数据分析(左手R右手Python实战)

Author: Haoxiang Qi（齐浩翔）
Date: Feb 8, 2020
Data from: GuangchuangYu (Github), jianxu305 (Github)

R包获取数据

# R
# 需要先安装包remotes
library(remotes)
# remotes::install_github("GuangchuangYu/nCov2019") 按照github里的数据包 From: GuangchuangYu
library(nCov2019)

# 获取数据
x <- get_nCov2019()
x[] %>% head(3)

##   name confirm suspect dead heal
## 1 湖北   27100       0  780 1439
## 2 广东    1120       0    1  125
## 3 浙江    1075       0    0  173

summary <- summary(x)
summary %>% head(3)

##   confirm suspect dead heal deadRate healRate  date
## 1      41       0    1    0      2.4      0.0 01.13
## 2      41       0    1    0      2.4      0.0 01.14
## 3      41       0    2    5      4.9     12.2 01.15

summary_new = summary(x, by='today')

全国累计总人数趋势图

数据处理

# Python

# 定义新函数：添加ticker
def convert_table(table):
  output_table = pd.DataFrame()
  for ticker in ["heal", "dead", "suspect", "confirm"]:
    col = table >> select(ticker, "date") >> mutate(condition = ticker)
    output_table = pd.concat([output_table, col.rename(columns={ticker:"number"})])
  return output_table.replace(["heal", "dead", "suspect", "confirm"],
  ["治愈", "死亡", "疑似", "确诊"]).reset_index().drop(columns="index")

# 应用函数添加ticker
summary = convert_table(r.summary)
summary.head(3)

##    number   date condition
## 0       0  01.13        治愈
## 1       0  01.14        治愈
## 2       5  01.15        治愈

可视化

# R
ggplot(py$summary, aes(x=as.Date(date, "%m.%d"), y=number, color=condition)) + # 谨记将日期由格式str转为date
  geom_line(size=.8, alpha=.7) +
  geom_point(aes(shape=condition)) +
  geom_hline(yintercept = 10000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 20000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 30000, color='black', size=.8, alpha=.5, linetype="dotted") +
  theme_classic() +
  scale_x_date(date_labels = "%m/%d", breaks = "2 days") +
  xlab("日期") +
  ylab("人数") +
  ggtitle("全国累计总人数趋势图") +
  theme(text = element_text(family = "Source Han Sans CN"))

分析：
1. 可以非常明显的看出确诊人数和疑似人数仍然持上升趋势，其中确诊人数呈现近似指数增长，预计2月9日即突破4万确诊，形势依然不容乐观。
2. 全国治愈人数逐渐上升突破1542人，各地的物资医疗支援开始初见成效。

有兴趣的同学可以查看下面的使用R语言预测拐点的链接
https://openr.pzhao.org/zh/blog/ncovr-03/

全国每日新增趋势图

数据处理

# Python
# 应用函数添加ticker
summary_new = convert_table(r.summary_new)
summary_new.sample(3)

##     number   date condition
## 35      65  02.04        死亡
## 26      24  01.26        死亡
## 65     688  01.25        确诊

可视化

# R
ggplot(py$summary_new, aes(x=as.Date(date, "%m.%d"), y=number, color=condition)) +
  geom_line(size=.8, alpha=.7) +
  geom_point(aes(shape=condition)) +
  geom_hline(yintercept = 2000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 4000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 6000, color='black', size=.8, alpha=.5, linetype="dotted") +
  theme_classic() +
  scale_x_date(date_labels = "%m/%d", breaks = "2 days") +
  xlab("日期") +
  ylab("人数") +
  ggtitle("全国每日新增趋势图") +
  theme(text = element_text(family = "Source Han Sans CN"))

分析：
1. 近几日新增疑似病例数量开始出现下降，但新增确诊病例并未出现大幅上升。这说明一方面可能因为确诊流程趋于完善，另一方面可能是疫情得到控制的先兆。
2. 得益于大量医务人员的不懈努力，新增治愈人数开始快速爬升。

全国确诊/疑似百分比堆积图

数据处理

# Python
# 只选出疑似和确诊
filter = summary[summary.condition.isin(["疑似", "确诊"])].sort_values("date") 

# 按日期聚合汇总并更改列名
total = filter.groupby("date").sum().reset_index().rename(columns={"number":"total"})

# 连结表并新建比例计算列
ratio_table = filter.merge(total, "left", on="date") >> mutate(ratio=X.number / X.total)
ratio_table.tail(3)

##     number   date condition  total     ratio
## 51   27657  02.07        疑似  62255  0.444253
## 52   28942  02.08        疑似  66193  0.437237
## 53   37251  02.08        确诊  66193  0.562763

可视化

# R
ggplot(py$ratio_table %>% tail(-14), aes(x=as.Date(date, "%m.%d"), y=ratio, fill=condition)) +
  geom_bar(stat="identity", alpha=.8) +
  geom_hline(yintercept = 0.5, color='black', size=.8, alpha=.7, linetype="dotted") +
  xlab("日期") +
  ylab("占比") +
  ggtitle("全国确诊/疑似百分比堆积图") +
  theme_classic() +
  scale_x_date(date_labels = "%m/%d", breaks = "2 days") +
  theme(text = element_text(family = "Source Han Sans CN")) +
  geom_text(aes(label=paste0(sprintf("%1.0f", ratio*100), "%")), position=position_stack(vjust=0.5), size=2.5)

分析：
1. 确诊人数占比近日出现上升，并于2月4日超过50%，说明确诊能力得到提升，感染人群容易被控制。
2. 确诊能力的提升使得更多处于潜伏期的人群能够被隔离，减少与未染病人群的接触，降低疫情扩大的可能。

R包的数据比较少，下面使用Python获取更多数据

# Python
# 有人帮我们把轮子造好了，这也是Python的好处之一，可以使用全世界的轮子
import utils # 一个从数仓下载数据并清洗的模块 From: jianxu305 (Github)

# 加载数据
cnov_data = utils.load_chinese_data()

## 最近更新于:  2020-02-09 10:23:01.835000
## 数据日期范围:  2020-01-24 to 2020-02-09
## 数据条目数:  26724

cnov_data = utils.aggDaily(cnov_data, clean_data=True) 

# 日期格式变为字符串(否则传递到R时会出错)
cnov_data.updateDate = cnov_data.updateDate.astype(str) 
cnov_data.drop(columns="updateTime", inplace=True)
cnov_data.head(3)

##       provinceName cityName  confirmed  cured  dead  updateDate
## 25734          云南省      丽江市          1      0     0  2020-01-24
## 25732          云南省       昆明          3      0     0  2020-01-24
## 25733          云南省    西双版纳州          1      0     0  2020-01-24

人口-GDP-死亡数气泡关系图 (除武汉外死亡数前十城市)

数据处理

# Python
# 只选出最新日期数据
sort_data = cnov_data[cnov_data.updateDate == cnov_data.updateDate.max()]

# 选出除武汉外前10城市
sort_data = sort_data.sort_values("dead", ascending=False).head(11).tail(10)

# 手动添加GDP，人口数据
extra_data = pd.DataFrame({"cityName":list(sort_data.cityName), 
                    "GDP":[2035, 1912, 1005, 1847, 591, 2082, 1011, 4064, 800, 4309],
                    "population":[750, 530, 108, 300, 161, 641, 222, 417, 156, 605]})
                    
# 连结数据
sort_data = sort_data.merge(extra_data, "inner", on="cityName")
sort_data.head(3)

##   provinceName cityName  confirmed  cured  dead  updateDate   GDP  population
## 0          湖北省       黄冈       2141    137    43  2020-02-09  2035         750
## 1          湖北省       孝感       2436     45    29  2020-02-09  1912         530
## 2          湖北省       鄂州        639     42    21  2020-02-09  1005         108

可视化

# R
ggplot(py$sort_data, aes(x=population, y=GDP, color=cityName)) +
  geom_point(aes(size=dead)) +
  geom_text(aes(label=paste(cityName, dead)), family = "Source Han Sans CN", hjust=-.5, size=3) +
  geom_abline(intercept=0, slope=6.5, color='red', size=.9, alpha=.5, linetype="dotted") +
  xlim(100, 800) +
  theme_classic() +
  xlab("人口  (万)") +
  ylab("GDP  (亿)") +
  ggtitle("人口-GDP-死亡数气泡关系图 (除武汉外死亡数前十城市)") +
  scale_size_continuous(range=c(2,11)) +
  annotate("text", x=530, y=3500, label="全国人均GDP分界线", alpha=.6, family="Source Han Sans CN", size=2.5) +
  annotate("text", x=600, y=1200, label="未达全国人均GDP", alpha=.8, family="Source Han Sans CN", size=4) +
  annotate("text", x=250, y=3000, label="超过全国人均GDP", alpha=.8, family="Source Han Sans CN", size=4) +
  theme(text = element_text(family = "Source Han Sans CN"))

分析：
手动录入10座城市的GDP和人口数据，并加入全国人均GDP标准线，想探究目前富裕程度和死亡人数的关系。
1. 襄阳和宜昌作为湖北省老二老三，拥有较为优良的医疗条件，所以即便人口众多也能尽可能降低死亡数。
2. 就死亡数来看，可能黄冈和孝感对医疗力量的需求更大，应加大对这两座城市的救援力度。

四省治愈率趋势图(湖北/浙江/河南/广东)

数据处理

# Python
# 以省份和日期聚合数据获得省维度数据
province_data = cnov_data.groupby(["provinceName", "updateDate"]).sum().reset_index()

# 以日期聚合数据获得国家维度数据再求平均
china = cnov_data.groupby("updateDate").sum().reset_index() >> mutate(provinceName = "全国平均")

# 合并数据再只选出四省和全国平均
total = pd.concat([province_data, china], sort=False)
total = total.sort_values(["provinceName", "confirmed"], ascending=False)
total = total[total.provinceName.isin(["湖北省", "广东省", "浙江省", "河南省", "全国平均"])]
total.head(3)

##     provinceName  updateDate  confirmed  cured  dead
## 347          湖北省  2020-02-09      27100   1439   780
## 346          湖北省  2020-02-08      24953   1218   699
## 345          湖北省  2020-02-07      22112    867   618

可视化

# R
# 新建治愈率计算列并选出三列
total <- py$total %>% mutate(cured_ratio = cured / confirmed * 100) %>% select(
  "provinceName", "updateDate", "cured_ratio")

# 定义新函数：方便一次性画出四张图
plot_provinces <- function(info, flag_title, flag_title_x, flag_title_y)
  {
  # 传递参数进入theme以方便消除多余的标题
  if (flag_title == TRUE)  
    title = element_blank()
  else
    title = element_text(size = 10, face = "bold")
  if (flag_title_x == TRUE)
    title_x = element_blank()
  else
    title_x = element_text(size = 8)
  if (flag_title_y == TRUE)
    title_y = element_blank()
  else
    title_y = element_text(size = 8)
  if (info == "浙江省")
    {mannul_colors = c("#E69F00", "#56B4E9")
    mannul_shapes = c(16, 17)}
  else
    {mannul_colors = c("#56B4E9", "#E69F00")
    mannul_shapes = c(17, 16)}
  
ggplot(total %>% filter(provinceName == info | provinceName == "全国平均"), aes(
  x=as.Date(updateDate, "%Y-%m-%d"), y=cured_ratio, color=provinceName)) + 
  geom_line(size=.7, alpha=.5) +
  geom_point(aes(shape=provinceName)) +
  theme_classic() +
  scale_x_date(date_labels = "%m/%d", breaks = "3 days") +
  xlab("日期") +
  ylab("治愈率 (%)") +
  geom_hline(yintercept = 5, color='black', size=.5, alpha=.7, linetype="dotted") +
  geom_hline(yintercept = 10, color='black', size=.5, alpha=.7, linetype="dotted") +
  geom_hline(yintercept = 15, color='black', size=.5, alpha=.7, linetype="dotted") +
  ylim(0,15) +
  ggtitle("四省治愈率趋势图(湖北/浙江/河南/广东)") +
  scale_color_manual(values=mannul_colors) +
  scale_shape_manual(values=mannul_shapes) +
  theme(text = element_text(family = "Source Han Sans CN")) +
  theme(
    plot.title = title,
    axis.text=element_text(size=8,face="bold"),
    axis.title.x=title_x,
    axis.title.y=title_y,
    legend.text = element_text(size=8),
    legend.title = element_text(size=8),
    )
}

# 调用函数绘制图
hubei <- plot_provinces("湖北省", FALSE, TRUE, FALSE)
henan <- plot_provinces("河南省", TRUE, FALSE, FALSE)
zhejiang <- plot_provinces("浙江省", TRUE, TRUE, TRUE)
guangdong <- plot_provinces("广东省", TRUE, FALSE, TRUE)

# 拼接为一张图
multiplot(hubei, henan, zhejiang, guangdong, cols=2)

分析：
1. 作为确诊人数最多的四省，除湖北外，其他省均达到或超过10%的治愈率，相信随着时间的增长，会有更多患者得以康复。
2. 正是由于浙江和河南政府的积极举措，使得这两省的治愈率最为理想，值得其他省份借鉴。

全国新增确诊趋势图

数据处理

# Python
# 筛选武汉数据/改列名
wuhan_data = cnov_data[cnov_data.cityName == "武汉"] >> rename(ticker = X.cityName)

# 筛选非湖北省数据/按时间聚合/改列名
not_hubei_data = province_data[province_data.provinceName != "湖北省"].groupby(
"updateDate").sum().reset_index() >> mutate(ticker = "全国(不含湖北)")

# 筛选湖北省数据/改列名
hubei_data = province_data[province_data.provinceName == "湖北省"] >> rename(ticker = X.provinceName)

# 定义新函数：计算每日新增
def calculation_data(table):
  table = table >> mutate(yesterday_confirmed = X.confirmed.shift(1)) >> mutate(
new_confirmed = X.confirmed-X.yesterday_confirmed) >> tail(-1) >> select(X.ticker,X.updateDate,X.new_confirmed)
  return table
  
# 应用新函数计算每日新增
hubei_data = calculation_data(hubei_data)
wuhan_data = calculation_data(wuhan_data)
not_hubei_data = calculation_data(not_hubei_data)

# 合并表
china_data = pd.concat([hubei_data, wuhan_data, not_hubei_data], join="inner").reset_index().drop(columns="index")
china_data.sample(3)

##       ticker  updateDate  new_confirmed
## 35  全国(不含湖北)  2020-01-28          525.0
## 13       湖北省  2020-02-07         2447.0
## 38  全国(不含湖北)  2020-01-31         1070.0

可视化

# R
ggplot(py$china_data, aes(x=as.Date(updateDate, "%Y-%m-%d"), y=new_confirmed, color=ticker)) +
  geom_line(size=.8, alpha=.7) +
  geom_point(aes(shape=ticker)) +
  xlab("日期") +
  ylab("人数") +
  ggtitle("全国新增确诊趋势图") +
  geom_hline(yintercept = 1000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 2000, color='black', size=.8, alpha=.5, linetype="dotted") +
  geom_hline(yintercept = 3000, color='black', size=.8, alpha=.5, linetype="dotted") +
  theme_classic() +
  theme(text = element_text(family = "Source Han Sans CN")) +
  scale_x_date(date_labels = "%m/%d", breaks = "2 days")

分析：
1. 可以非常直观地看出，除武汉外，其余省份疫情得到有效的控制，新增确诊人数呈现稳定下降趋势。
2. 武汉以及湖北省疫情最为严峻，目前还未看出好转迹象。