データの整形

データフレームの整形

使用するパッケージ
* tidyverse・・・データフレームの操作
* glimpse() / filter() / select() / mutate()

  • lubridate・・・日付の取扱
    • ymd()

dataの把握  

データ把握にはglimpse()を使う

  • 変数一覧
  • 変数の数(=列数)
  • サンプル数 (=行数)
  • 各変数の型
library("tidyverse")
df<-read_csv("R_Seminar_data.csv")
#glimpse
df %>% 
  glimpse()
## Observations: 500
## Variables: 17
## $ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Year          <dbl> 2010, 2012, 2012, 2011, 2013, 2010, 2014, 2011, 2012,...
## $ Admday        <chr> "2010/10/24", "2012/9/24", "2012/12/9", "2011/9/9", "...
## $ Discday       <chr> "2010/11/5", "2012/10/3", "2012/12/14", "2011/9/19", ...
## $ New_Treatment <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Age           <dbl> 62, 82, 75, 78, 78, 68, 72, 71, 80, 72, 75, 63, 77, 6...
## $ Sex           <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1,...
## $ Height        <dbl> 167, 156, 155, 153, 154, 157, 168, 154, 165, 152, 167...
## $ Weight        <dbl> 75.8, 57.0, 61.2, 49.5, 52.5, 61.1, 64.3, 47.3, 61.9,...
## $ DM            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Stroke        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI            <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Severity      <dbl> 2, 3, 2, 2, 8, 3, 1, 7, 11, 2, 6, 1, 2, 1, 1, 2, 1, 1...
## $ Death         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BMI2          <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI2_cat      <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS2          <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...

パイプ(%>%)

左の結果を右に渡す演算子(ctrl(command)+shift+M)
ex: c(1,2,3,4) %>% mean()

c(1,2,3,4) %>% mean()  
## [1] 2.5

複雑なスクリプトをわかりやすく書ける
余計なオブジェクトを作らなくて良い

ex: c(1,1,2,2,3) %>% mean() %>% round()

c(1,1,2,2,3) %>% mean() %>% round()
## [1] 2

filter()

filter(条件)
指定の条件で取り出す
ex: New_Treatmentを行った患者のみ

df %>% filter(New_Treatment==1)
## # A tibble: 179 x 17
##       id  Year Admday Discday New_Treatment   Age   Sex Height Weight    DM
##    <dbl> <dbl> <chr>  <chr>           <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
##  1     2  2012 2012/~ 2012/1~             1    82     2    156   57       1
##  2     3  2012 2012/~ 2012/1~             1    75     2    155   61.2     0
##  3     5  2013 2013/~ 2013/1~             1    78     2    154   52.5     0
##  4     7  2014 2014/~ 2014/3~             1    72     2    168   64.3     0
##  5     8  2011 2011/~ 2011/1~             1    71     2    154   47.3     0
##  6    10  2014 2014/~ 2014/3~             1    72     2    152   51.3     0
##  7    13  2012 2012/~ 2012/1~             1    77     2    157   66.5     0
##  8    16  2012 2012/~ 2012/4~             1    73     1    158   61.4     0
##  9    20  2013 2013/~ 2013/9~             1    82     1    153   49.3     0
## 10    22  2011 2011/~ 2011/2~             1    84     2    157   53.7     0
## # ... with 169 more rows, and 7 more variables: Stroke <dbl>, MI <dbl>,
## #   Severity <dbl>, Death <dbl>, BMI2 <dbl>, BMI2_cat <chr>, LOS2 <dbl>

条件を指定する演算子

sign condition
== equal
!= not equal
>= equal and more
<= equal and less
> more
< less
$ AND
OR

select()

select(変数名) 変数名で列を取り出す

df %>% select(id, Year, Age) #id, Year, Ageの列の取り出し
## # A tibble: 500 x 3
##       id  Year   Age
##    <dbl> <dbl> <dbl>
##  1     1  2010    62
##  2     2  2012    82
##  3     3  2012    75
##  4     4  2011    78
##  5     5  2013    78
##  6     6  2010    68
##  7     7  2014    72
##  8     8  2011    71
##  9     9  2012    80
## 10    10  2014    72
## # ... with 490 more rows

新しい変数の作成

mutate

mutate(newname = formura)

#create BMI
df %>% mutate(BMI = Weight/Height/Height*10000) -> df # ->で最後に入れ込む
summary(df$BMI) # check the variale!
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.70   21.22   22.85   22.70   24.31   28.14

case_when

連続変数を指定した閾値でカテゴリー化する case_when(A ~ B, #条件AならばBというカテゴリー分類 C ~ D, E ~ F)

df %>% mutate(BMI_cat = case_when(BMI < 18.5 ~"underweight",
                                  BMI < 22.5 ~ "Low_normal",
                                  BMI < 25 ~ "High_normal",
                                  BMI < 30 ~ "Overweight",
                                  BMI > 30 ~ "Obesity")) -> df
table(df$BMI_cat)
## 
## High_normal  Low_normal  Overweight underweight 
##         209         195          76          20

入院期間の作成

Package “lubridate”を使う
admday, discdayは文字列型変数なので引き算できない
ymd()で日付型データに変換する)必要がある

library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
class("2019/12/24") # characterになっている                     
## [1] "character"
class(ymd("2019/12/24")) #Dateに変換
## [1] "Date"

入院期間の作成

日付に変換

df %>% mutate(Admday = ymd(Admday), Discday = ymd (Discday)) -> df

#日付の引き算
df %>%mutate(LOS = as.integer(Discday - Admday)+1) -> df #regressionに使えないのでas.integerで整数に変えておく
summary(df$LOS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    9.00   12.00   13.37   16.00   57.00

Exercise2

level 1
dfの概要を一覧せよ

glimpse(df) # INFERENCEで推奨 glimpse()
## Observations: 500
## Variables: 20
## $ id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Year          <dbl> 2010, 2012, 2012, 2011, 2013, 2010, 2014, 2011, 2012,...
## $ Admday        <date> 2010-10-24, 2012-09-24, 2012-12-09, 2011-09-09, 2013...
## $ Discday       <date> 2010-11-05, 2012-10-03, 2012-12-14, 2011-09-19, 2013...
## $ New_Treatment <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Age           <dbl> 62, 82, 75, 78, 78, 68, 72, 71, 80, 72, 75, 63, 77, 6...
## $ Sex           <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1,...
## $ Height        <dbl> 167, 156, 155, 153, 154, 157, 168, 154, 165, 152, 167...
## $ Weight        <dbl> 75.8, 57.0, 61.2, 49.5, 52.5, 61.1, 64.3, 47.3, 61.9,...
## $ DM            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Stroke        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI            <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Severity      <dbl> 2, 3, 2, 2, 8, 3, 1, 7, 11, 2, 6, 1, 2, 1, 1, 2, 1, 1...
## $ Death         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BMI2          <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI2_cat      <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS2          <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...
## $ BMI           <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI_cat       <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS           <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...
head(df) #簡単にhead()
## # A tibble: 6 x 20
##      id  Year Admday     Discday    New_Treatment   Age   Sex Height Weight
##   <dbl> <dbl> <date>     <date>             <dbl> <dbl> <dbl>  <dbl>  <dbl>
## 1     1  2010 2010-10-24 2010-11-05             0    62     2    167   75.8
## 2     2  2012 2012-09-24 2012-10-03             1    82     2    156   57  
## 3     3  2012 2012-12-09 2012-12-14             1    75     2    155   61.2
## 4     4  2011 2011-09-09 2011-09-19             0    78     2    153   49.5
## 5     5  2013 2013-01-12 2013-01-26             1    78     2    154   52.5
## 6     6  2010 2010-08-06 2010-08-14             0    68     1    157   61.1
## # ... with 11 more variables: DM <dbl>, Stroke <dbl>, MI <dbl>, Severity <dbl>,
## #   Death <dbl>, BMI2 <dbl>, BMI2_cat <chr>, LOS2 <dbl>, BMI <dbl>,
## #   BMI_cat <chr>, LOS <dbl>
str(df) #詳しくstr()
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 500 obs. of  20 variables:
##  $ id           : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year         : num  2010 2012 2012 2011 2013 ...
##  $ Admday       : Date, format: "2010-10-24" "2012-09-24" ...
##  $ Discday      : Date, format: "2010-11-05" "2012-10-03" ...
##  $ New_Treatment: num  0 1 1 0 1 0 1 1 0 1 ...
##  $ Age          : num  62 82 75 78 78 68 72 71 80 72 ...
##  $ Sex          : num  2 2 2 2 2 1 2 2 2 2 ...
##  $ Height       : num  167 156 155 153 154 157 168 154 165 152 ...
##  $ Weight       : num  75.8 57 61.2 49.5 52.5 61.1 64.3 47.3 61.9 51.3 ...
##  $ DM           : num  0 1 0 0 0 0 0 0 1 0 ...
##  $ Stroke       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ MI           : num  0 0 0 1 0 1 0 0 0 1 ...
##  $ Severity     : num  2 3 2 2 8 3 1 7 11 2 ...
##  $ Death        : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ BMI2         : num  27.2 23.4 25.5 21.1 22.1 ...
##  $ BMI2_cat     : chr  "Overweight" "High_normal" "Overweight" "Low_normal" ...
##  $ LOS2         : num  13 10 6 11 15 9 19 22 15 10 ...
##  $ BMI          : num  27.2 23.4 25.5 21.1 22.1 ...
##  $ BMI_cat      : chr  "Overweight" "High_normal" "Overweight" "Low_normal" ...
##  $ LOS          : num  13 10 6 11 15 9 19 22 15 10 ...

DM患者のみをdf_dmに格納せよ

df %>% filter(DM==1) -> df_dm

Level 2
年齢を<=65, 65-75, >75にカテゴリー化 せよ

df %>% mutate(age_cat= case_when(Age <=65 ~ ">=65",
              Age <=75 ~ ">65, <=75",
              Age >75 ~ ">75"))->df
table(df$age_cat)
## 
##      >=65 >65, <=75       >75 
##        27       241       232

dfから男性のid, sex, LOSを取り出し,df_maleに格納

df_male<-df %>% filter(Sex==1) %>% select("id","LOS")
head(df_male)
## # A tibble: 6 x 2
##      id   LOS
##   <dbl> <dbl>
## 1     6     9
## 2    16    17
## 3    18     6
## 4    20    13
## 5    24    45
## 6    28    13
summary(df_male)
##        id             LOS       
##  Min.   :  6.0   Min.   : 5.00  
##  1st Qu.:149.8   1st Qu.: 9.00  
##  Median :267.0   Median :12.50  
##  Mean   :262.6   Mean   :13.87  
##  3rd Qu.:388.2   3rd Qu.:17.00  
##  Max.   :498.0   Max.   :48.00