使用するパッケージ
* tidyverse・・・データフレームの操作
* glimpse() / filter() / select() / mutate()
データ把握にはglimpse()を使う
library("tidyverse")
df<-read_csv("R_Seminar_data.csv")
#glimpse
df %>%
glimpse()
## Observations: 500
## Variables: 17
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Year <dbl> 2010, 2012, 2012, 2011, 2013, 2010, 2014, 2011, 2012,...
## $ Admday <chr> "2010/10/24", "2012/9/24", "2012/12/9", "2011/9/9", "...
## $ Discday <chr> "2010/11/5", "2012/10/3", "2012/12/14", "2011/9/19", ...
## $ New_Treatment <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Age <dbl> 62, 82, 75, 78, 78, 68, 72, 71, 80, 72, 75, 63, 77, 6...
## $ Sex <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1,...
## $ Height <dbl> 167, 156, 155, 153, 154, 157, 168, 154, 165, 152, 167...
## $ Weight <dbl> 75.8, 57.0, 61.2, 49.5, 52.5, 61.1, 64.3, 47.3, 61.9,...
## $ DM <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Stroke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Severity <dbl> 2, 3, 2, 2, 8, 3, 1, 7, 11, 2, 6, 1, 2, 1, 1, 2, 1, 1...
## $ Death <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BMI2 <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI2_cat <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS2 <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...
左の結果を右に渡す演算子(ctrl(command)+shift+M)
ex: c(1,2,3,4) %>% mean()
c(1,2,3,4) %>% mean()
## [1] 2.5
複雑なスクリプトをわかりやすく書ける
余計なオブジェクトを作らなくて良い
ex: c(1,1,2,2,3) %>% mean() %>% round()
c(1,1,2,2,3) %>% mean() %>% round()
## [1] 2
filter(条件)
指定の条件で取り出す
ex: New_Treatmentを行った患者のみ
df %>% filter(New_Treatment==1)
## # A tibble: 179 x 17
## id Year Admday Discday New_Treatment Age Sex Height Weight DM
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2012 2012/~ 2012/1~ 1 82 2 156 57 1
## 2 3 2012 2012/~ 2012/1~ 1 75 2 155 61.2 0
## 3 5 2013 2013/~ 2013/1~ 1 78 2 154 52.5 0
## 4 7 2014 2014/~ 2014/3~ 1 72 2 168 64.3 0
## 5 8 2011 2011/~ 2011/1~ 1 71 2 154 47.3 0
## 6 10 2014 2014/~ 2014/3~ 1 72 2 152 51.3 0
## 7 13 2012 2012/~ 2012/1~ 1 77 2 157 66.5 0
## 8 16 2012 2012/~ 2012/4~ 1 73 1 158 61.4 0
## 9 20 2013 2013/~ 2013/9~ 1 82 1 153 49.3 0
## 10 22 2011 2011/~ 2011/2~ 1 84 2 157 53.7 0
## # ... with 169 more rows, and 7 more variables: Stroke <dbl>, MI <dbl>,
## # Severity <dbl>, Death <dbl>, BMI2 <dbl>, BMI2_cat <chr>, LOS2 <dbl>
条件を指定する演算子
sign | condition |
---|---|
== | equal |
!= | not equal |
>= | equal and more |
<= | equal and less |
> | more |
< | less |
$ | AND |
| | OR |
select(変数名) 変数名で列を取り出す
df %>% select(id, Year, Age) #id, Year, Ageの列の取り出し
## # A tibble: 500 x 3
## id Year Age
## <dbl> <dbl> <dbl>
## 1 1 2010 62
## 2 2 2012 82
## 3 3 2012 75
## 4 4 2011 78
## 5 5 2013 78
## 6 6 2010 68
## 7 7 2014 72
## 8 8 2011 71
## 9 9 2012 80
## 10 10 2014 72
## # ... with 490 more rows
mutate(newname = formura)
#create BMI
df %>% mutate(BMI = Weight/Height/Height*10000) -> df # ->で最後に入れ込む
summary(df$BMI) # check the variale!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.70 21.22 22.85 22.70 24.31 28.14
連続変数を指定した閾値でカテゴリー化する case_when(A ~ B, #条件AならばBというカテゴリー分類 C ~ D, E ~ F)
df %>% mutate(BMI_cat = case_when(BMI < 18.5 ~"underweight",
BMI < 22.5 ~ "Low_normal",
BMI < 25 ~ "High_normal",
BMI < 30 ~ "Overweight",
BMI > 30 ~ "Obesity")) -> df
table(df$BMI_cat)
##
## High_normal Low_normal Overweight underweight
## 209 195 76 20
Package “lubridate”を使う
admday, discdayは文字列型変数なので引き算できない
ymd()で日付型データに変換する)必要がある
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.5.3
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
class("2019/12/24") # characterになっている
## [1] "character"
class(ymd("2019/12/24")) #Dateに変換
## [1] "Date"
日付に変換
df %>% mutate(Admday = ymd(Admday), Discday = ymd (Discday)) -> df
#日付の引き算
df %>%mutate(LOS = as.integer(Discday - Admday)+1) -> df #regressionに使えないのでas.integerで整数に変えておく
summary(df$LOS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 9.00 12.00 13.37 16.00 57.00
level 1
dfの概要を一覧せよ
glimpse(df) # INFERENCEで推奨 glimpse()
## Observations: 500
## Variables: 20
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Year <dbl> 2010, 2012, 2012, 2011, 2013, 2010, 2014, 2011, 2012,...
## $ Admday <date> 2010-10-24, 2012-09-24, 2012-12-09, 2011-09-09, 2013...
## $ Discday <date> 2010-11-05, 2012-10-03, 2012-12-14, 2011-09-19, 2013...
## $ New_Treatment <dbl> 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Age <dbl> 62, 82, 75, 78, 78, 68, 72, 71, 80, 72, 75, 63, 77, 6...
## $ Sex <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1,...
## $ Height <dbl> 167, 156, 155, 153, 154, 157, 168, 154, 165, 152, 167...
## $ Weight <dbl> 75.8, 57.0, 61.2, 49.5, 52.5, 61.1, 64.3, 47.3, 61.9,...
## $ DM <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Stroke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MI <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Severity <dbl> 2, 3, 2, 2, 8, 3, 1, 7, 11, 2, 6, 1, 2, 1, 1, 2, 1, 1...
## $ Death <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BMI2 <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI2_cat <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS2 <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...
## $ BMI <dbl> 27.17917, 23.42209, 25.47347, 21.14571, 22.13695, 24....
## $ BMI_cat <chr> "Overweight", "High_normal", "Overweight", "Low_norma...
## $ LOS <dbl> 13, 10, 6, 11, 15, 9, 19, 22, 15, 10, 7, 10, 13, 6, 6...
head(df) #簡単にhead()
## # A tibble: 6 x 20
## id Year Admday Discday New_Treatment Age Sex Height Weight
## <dbl> <dbl> <date> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2010 2010-10-24 2010-11-05 0 62 2 167 75.8
## 2 2 2012 2012-09-24 2012-10-03 1 82 2 156 57
## 3 3 2012 2012-12-09 2012-12-14 1 75 2 155 61.2
## 4 4 2011 2011-09-09 2011-09-19 0 78 2 153 49.5
## 5 5 2013 2013-01-12 2013-01-26 1 78 2 154 52.5
## 6 6 2010 2010-08-06 2010-08-14 0 68 1 157 61.1
## # ... with 11 more variables: DM <dbl>, Stroke <dbl>, MI <dbl>, Severity <dbl>,
## # Death <dbl>, BMI2 <dbl>, BMI2_cat <chr>, LOS2 <dbl>, BMI <dbl>,
## # BMI_cat <chr>, LOS <dbl>
str(df) #詳しくstr()
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 500 obs. of 20 variables:
## $ id : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : num 2010 2012 2012 2011 2013 ...
## $ Admday : Date, format: "2010-10-24" "2012-09-24" ...
## $ Discday : Date, format: "2010-11-05" "2012-10-03" ...
## $ New_Treatment: num 0 1 1 0 1 0 1 1 0 1 ...
## $ Age : num 62 82 75 78 78 68 72 71 80 72 ...
## $ Sex : num 2 2 2 2 2 1 2 2 2 2 ...
## $ Height : num 167 156 155 153 154 157 168 154 165 152 ...
## $ Weight : num 75.8 57 61.2 49.5 52.5 61.1 64.3 47.3 61.9 51.3 ...
## $ DM : num 0 1 0 0 0 0 0 0 1 0 ...
## $ Stroke : num 0 0 0 0 0 0 0 0 0 0 ...
## $ MI : num 0 0 0 1 0 1 0 0 0 1 ...
## $ Severity : num 2 3 2 2 8 3 1 7 11 2 ...
## $ Death : num 0 0 0 0 0 0 0 0 1 0 ...
## $ BMI2 : num 27.2 23.4 25.5 21.1 22.1 ...
## $ BMI2_cat : chr "Overweight" "High_normal" "Overweight" "Low_normal" ...
## $ LOS2 : num 13 10 6 11 15 9 19 22 15 10 ...
## $ BMI : num 27.2 23.4 25.5 21.1 22.1 ...
## $ BMI_cat : chr "Overweight" "High_normal" "Overweight" "Low_normal" ...
## $ LOS : num 13 10 6 11 15 9 19 22 15 10 ...
DM患者のみをdf_dmに格納せよ
df %>% filter(DM==1) -> df_dm
Level 2
年齢を<=65, 65-75, >75にカテゴリー化 せよ
df %>% mutate(age_cat= case_when(Age <=65 ~ ">=65",
Age <=75 ~ ">65, <=75",
Age >75 ~ ">75"))->df
table(df$age_cat)
##
## >=65 >65, <=75 >75
## 27 241 232
dfから男性のid, sex, LOSを取り出し,df_maleに格納
df_male<-df %>% filter(Sex==1) %>% select("id","LOS")
head(df_male)
## # A tibble: 6 x 2
## id LOS
## <dbl> <dbl>
## 1 6 9
## 2 16 17
## 3 18 6
## 4 20 13
## 5 24 45
## 6 28 13
summary(df_male)
## id LOS
## Min. : 6.0 Min. : 5.00
## 1st Qu.:149.8 1st Qu.: 9.00
## Median :267.0 Median :12.50
## Mean :262.6 Mean :13.87
## 3rd Qu.:388.2 3rd Qu.:17.00
## Max. :498.0 Max. :48.00