This pet project started from a desire to track what my daughter’s doctors were prescribing her. My daughter Hayoon spent most of her days in daycare and would often end up catching various viruses or developing skin issues. Sometimes she needed to visit the doctor once or twice each week. In Korea, it is incredibly easy to and convenient to visit any doctor or clinic of your choice, but there is never enough time to deeply consult with medical professionals concerning the health issues that brought you in for their care. Each time we left the doctor’s office with a new prescription and I began to wonder exactly what kind of medicine was being prescribed. Even though I asked the doctors to share my daughter’s EMR data with me, I could not get them to share a digital version, so I needed to collect all of her paper prescriptions and type them into a spreadsheet by hand.
I collected all medication prescriptions and analyzed the data to see what the relation was between disease codes and medications prescribed. I first collected the data by hand then analyzed and visualized them with R programming.
The project led me through the whole process of data analysis from collecting raw data, processing the data for analyzing, actual analysis and visualization. It encouraged me to join an R community and study statistics and R programming language.
Also, this personal project revealed to me the issue of health autonomy and medical data ownership when I was not allowed to access to my own daughter’s medical data. It showed me the true value of data, and led me to believe the individual should have some rights over it. It beckons me to learn more about the data market and data technology of the healthcare industry. It even led me to learn more about medical data saving controversies that erupted in Korea and about an uptick in medicinal overdoses due to “medical shopping”.
Whenever one gets a prescription in Korea, they are always presented with two copies. One is for the patient and the other is for the drug store. I collected almost a year’s worth of paper prescription from my daughter’s doctor visits and transcribed the data into a spreadsheet.
I used ten variables for column headers, with key variables being “disease code” and “medicine name” to verify the existence of a correlation.
Making a chart requires knowledge of statistics, and making them attractive requires visual literacy and graphic skills. I chose R programming language so that I could use a diversity of charts to make beautiful visualizations. Using graphic tools to create charts is both slow and imprecise. By using R, I could become more familiar with statistics and programming at the same time.
Before compiling the data, I had no idea what to ask or suspect. While wrangling the data, questions began to arise like why a certain diagnosis occurred so frequently, or the different prescribing habits of each doctor. In making these charts, not only were trends revealed, but it also brought new thoughts and questions to mind, deepening my desire to learn more.
library(readxl)
Warning: package 'readxl' was built under R version 3.5.2
library(tidyverse)
─ Attaching packages ────────────────────────────────────── tidyverse 1.2.1 ─
✔ ggplot2 3.2.1 ✔ purrr 0.2.5
✔ tibble 2.1.3 ✔ dplyr 0.8.3
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.1.1 ✔ forcats 0.3.0
Warning: package 'ggplot2' was built under R version 3.5.2
Warning: package 'tibble' was built under R version 3.5.2
Warning: package 'dplyr' was built under R version 3.5.2
─ Conflicts ──────────────────────────────────────── tidyverse_conflicts() ─
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
par(family="AppleGothic")
hayoon<-read_excel("hayoon_hist.xlsx")
New names:
* `` -> ...13
* `` -> ...14
* `` -> ...15
* `` -> ...16
* `` -> ...17
* … and 3 more problems
str(hayoon)
Classes 'tbl_df', 'tbl' and 'data.frame': 156 obs. of 20 variables:
$ date : chr "20190912" "20190912" "20190912" "20190912" ...
$ persc : chr "04854호" "04854호" "04854호" "04854호" ...
$ hosp : chr "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" ...
$ disease : chr "J303, R093" "J303, R093" "J303, R093" "J303, R093" ...
$ doc : chr "김병기" "김병기" "김병기" "김병기" ...
$ medic : num 6.49e+08 6.44e+08 6.46e+08 6.46e+08 6.44e+08 ...
$ medic_name : chr "디푸루칸건조시럽" "레브로콜정" "코비안에스시럽" "씨투스건조시럽" ...
$ once : chr "4" "0.5" "4" "0.33329999999999999" ...
$ one_day : num 1 3 3 3 3 3 3 3 3 3 ...
$ total_count: num 4 4 4 4 4 3 5 5 5 5 ...
$ no_ins : chr NA NA NA NA ...
$ double : chr NA NA NA NA ...
$ ...13 : logi NA NA NA NA NA NA ...
$ ...14 : logi NA NA NA NA NA NA ...
$ ...15 : logi NA NA NA NA NA NA ...
$ ...16 : logi NA NA NA NA NA NA ...
$ ...17 : logi NA NA NA NA NA NA ...
$ ...18 : logi NA NA NA NA NA NA ...
$ ...19 : logi NA NA NA NA NA NA ...
$ ...20 : logi NA NA NA NA NA NA ...
#단순 통계
names(hayoon) #칼럼명만 보기
[1] "date" "persc" "hosp" "disease" "doc"
[6] "medic" "medic_name" "once" "one_day" "total_count"
[11] "no_ins" "double" "...13" "...14" "...15"
[16] "...16" "...17" "...18" "...19" "...20"
dim(hayoon) # 행렬 개수
[1] 156 20
# group_by()와 같은 결과, excel의 countif()함수
table(hayoon$disease)
H6501, J209 H6501, J459 H6501, L2088 H6641, J209 H6641, J303
6 11 7 5 22
J180, J459 J209, J060 J209, J303 J209, J459 J303, R093
6 10 9 6 12
J459, J459 L209, J209 L509, L039 R113, A090
33 17 6 4
library(stringr) #문자열 패턴 찾기 패키지
# date 란 단어가 있는 칼럼 모두 찾기
date_cols <- str_detect(names(hayoon), "date")
date_cols
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
lubridate: charactor를 날짜로 변환하는 패키지 ymd: year-month-date롤 변환
library(lubridate) #chr를 날짜로 변환하는 패키지
Attaching package: 'lubridate'
The following object is masked from 'package:base':
date
# 하윤 데이터 칼럼에 date란 단어가 든 칼럼 날짜로 변환하기
hayoon[ ,date_cols] <- lapply(hayoon[ ,date_cols], ymd) #lappy(데이터셋, 함수)
Warning: 2 failed to parse.
hayoon[ ,date_cols]
# A tibble: 156 x 1
date
<date>
1 2019-09-12
2 2019-09-12
3 2019-09-12
4 2019-09-12
5 2019-09-12
6 2019-09-10
7 2019-09-10
8 2019-09-10
9 2019-09-10
10 2019-09-10
# … with 146 more rows
print(hayoon[ ,date_cols], n=5)
# A tibble: 156 x 1
date
<date>
1 2019-09-12
2 2019-09-12
3 2019-09-12
4 2019-09-12
5 2019-09-12
# … with 151 more rows
str(hayoon)
Classes 'tbl_df', 'tbl' and 'data.frame': 156 obs. of 20 variables:
$ date : Date, format: "2019-09-12" "2019-09-12" ...
$ persc : chr "04854호" "04854호" "04854호" "04854호" ...
$ hosp : chr "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" ...
$ disease : chr "J303, R093" "J303, R093" "J303, R093" "J303, R093" ...
$ doc : chr "김병기" "김병기" "김병기" "김병기" ...
$ medic : num 6.49e+08 6.44e+08 6.46e+08 6.46e+08 6.44e+08 ...
$ medic_name : chr "디푸루칸건조시럽" "레브로콜정" "코비안에스시럽" "씨투스건조시럽" ...
$ once : chr "4" "0.5" "4" "0.33329999999999999" ...
$ one_day : num 1 3 3 3 3 3 3 3 3 3 ...
$ total_count: num 4 4 4 4 4 3 5 5 5 5 ...
$ no_ins : chr NA NA NA NA ...
$ double : chr NA NA NA NA ...
$ ...13 : logi NA NA NA NA NA NA ...
$ ...14 : logi NA NA NA NA NA NA ...
$ ...15 : logi NA NA NA NA NA NA ...
$ ...16 : logi NA NA NA NA NA NA ...
$ ...17 : logi NA NA NA NA NA NA ...
$ ...18 : logi NA NA NA NA NA NA ...
$ ...19 : logi NA NA NA NA NA NA ...
$ ...20 : logi NA NA NA NA NA NA ...
summary(hayoon)
date persc hosp
Min. :2018-12-10 Length:156 Length:156
1st Qu.:2019-04-02 Class :character Class :character
Median :2019-06-01 Mode :character Mode :character
Mean :2019-06-04
3rd Qu.:2019-08-19
Max. :2019-12-26
NA's :2
disease doc medic
Length:156 Length:156 Min. : 64350611
Class :character Class :character 1st Qu.:643504180
Mode :character Mode :character Median :643506111
Mean :641438965
3rd Qu.:645700935
Max. :671803433
NA's :2
medic_name once one_day total_count
Length:156 Length:156 Min. :1.000 Min. :1.000
Class :character Class :character 1st Qu.:3.000 1st Qu.:3.000
Mode :character Mode :character Median :3.000 Median :3.000
Mean :2.701 Mean :3.131
3rd Qu.:3.000 3rd Qu.:3.000
Max. :3.000 Max. :5.000
NA's :2 NA's :3
no_ins double ...13 ...14
Length:156 Length:156 Mode:logical Mode:logical
Class :character Class :character NA's:156 NA's:156
Mode :character Mode :character
...15 ...16 ...17 ...18
Mode:logical Mode:logical Mode:logical Mode:logical
NA's:156 NA's:156 NA's:156 NA's:156
...19 ...20
Mode:logical Mode:logical
NA's:156 NA's:156
# missing value 찾기
missing <- lapply(hayoon[,date_cols], is.na)
missing
$date
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[155] TRUE TRUE
num_missing <- sapply(missing, sum)
num_missing
date
2
# 불필요한 칼럼 제거
remove <- c(13:20)
# 불필요한 행(row) 제거
remove_row<-c(155:156)
cleanHayoonL<-hayoon[-remove_row, -remove]
str(cleanHayoonL)
Classes 'tbl_df', 'tbl' and 'data.frame': 154 obs. of 12 variables:
$ date : Date, format: "2019-09-12" "2019-09-12" ...
$ persc : chr "04854호" "04854호" "04854호" "04854호" ...
$ hosp : chr "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" "아주맑은소아청소년과의원" ...
$ disease : chr "J303, R093" "J303, R093" "J303, R093" "J303, R093" ...
$ doc : chr "김병기" "김병기" "김병기" "김병기" ...
$ medic : num 6.49e+08 6.44e+08 6.46e+08 6.46e+08 6.44e+08 ...
$ medic_name : chr "디푸루칸건조시럽" "레브로콜정" "코비안에스시럽" "씨투스건조시럽" ...
$ once : chr "4" "0.5" "4" "0.33329999999999999" ...
$ one_day : num 1 3 3 3 3 3 3 3 3 3 ...
$ total_count: num 4 4 4 4 4 3 5 5 5 5 ...
$ no_ins : chr NA NA NA NA ...
$ double : chr NA NA NA NA ...
dim(cleanHayoonL)
[1] 154 12
#질병분류기호 factor로 만들기
cleanHayoonL$medic<-as.factor(cleanHayoonL$medic)
#질병분류기호 빈도 보기
medic_freq<-table(cleanHayoonL$medic)
theme_ft <- theme_light()+
theme(axis.title = element_text(size=8, face="bold"),
axis.text.x = element_text(size=6, angle=90),
axis.text.y = element_text(size=6),
plot.title = element_text(size=10, color="#333333", face="bold.italic"))
Disease codes are selected by the doctor to prescribe medicine. In the patient prescription copy, there were two major disease codes. To find out what disease she had most often in a year, I made a stack bar chart: the x axis is disease code and the y axis is the number of times it appears. Disease codes are the combination of alphabet and numbers, I factorized them to make a bar chart. The graph shows how often each disease code occured.
질병분류기호의 칼럼명은 disease이다. 막대챠트를 그리기 위해서는 카테고리 데이터가 필요하므로 factor로 변환시켜 주어 x좌표에 위치시킨다. 의약품의 칼럼명은 medic이다. 막대그래프안에 의약품별로 분류하기 위해 fill함수로 추가한다. 챠트를 보면 핑크색, 초록색 의약품이 전반적으로 씌여있다. 핑크색이 씌여지지 않은 막대 4, 5번이 눈에 띈다.
ggplot(cleanHayoonL, aes(x=factor(disease), fill=medic))+
geom_bar()+
labs(x='Disease codes',
y='Count of prescription',
title='Which diseases were diagnosed?',
fill ='Drug codes')+
theme_ft+
scale_fill_hue(h=c(0,300))
The positions were adjusted by dodging overlaps to the side. This makes it possible to more accurately know the type and dosage of drug prescribed by disease code classification.
geom_bar함수 안에 position=“dodge”로 설정하면 factor 요소가 분리되어 표현된다. 질병분류별 처방된 약의 종류와 용량을 더 정확하게 알 수 있다. geom_bar에 아무런 인자를 넣지 않으면 position=“stack”이 디폴트가 된다. 상단 챠트와 같이 그려진다.
ggplot(cleanHayoonL, aes(x=factor(disease), fill=medic))+geom_bar(position="dodge")+
labs(x='Disease codes',
y='Count of prescription',
title='Which medications were prescribed for each disease code?',
fill ='Drug codes')+
theme_light()+
theme_ft+
scale_fill_hue(h=c(0,300))
하지만 약의 종류가 많아서 제대로 알아 볼 수가 없다. 어떻게 해야 할까?
The proper quantity could not be directly compared between different medications, so I utilized a ratio comparison. Also, there were 32 different medications; too many to simply convey information by assigning each medication a distinct color. An overload of colors would not help to differentiate between each medication. Using only one color was easier on the eyes, and I was able to differentiate between each one by adjusting the opacity. I used the color palette package to achieve this effect.
의약품 개수가 많으므로 색을 조정해 보기로 했다. 색 팔레트 패키지 설치 install.packages(“RColorBrewer”) position=“fill”로 하여 막대의 길이를 동일하게 해서 각 처방날 마다 약의 비율만 비교해 보기로 했다.
library(RColorBrewer)
display.brewer.all(n=10, exact.n=FALSE)
# 색깔 팔레트 선택해서 넣기
blues<-brewer.pal(9,"Blues")
blue_range<-colorRampPalette(blues)
ggplot(cleanHayoonL, aes(x=factor(disease), fill=medic))+
geom_bar(position="fill")+
labs(x='Disease codes',
y='Count of prescription',
title='Which medications were prescribed the most?',
fill = 'Drug Codes')+
theme_ft+
scale_fill_manual(values=blue_range(32)) #32개 의약품 색처리
다른 색으로도 챠트를 그려보자.
RColorBrewer패키지에서 색 셋트를 불러오는 함수는 brewer.pal(n, name)이다. n: 사용하려는 색의 개수, name: 팔레트 이름
#팔레트 "PuOr" 셋트를 선택하고, 그 중에 10개의 색을 기본으로 선택한다.
cols<-brewer.pal(10, "PuOr")
#선택한 색을 colorRampPalette의 함수로 interpolate 하도록 만든다.
# colorRampPalette함수는 가져온 색 셋트에서 몇 개의 색을 뿌려줄지 선택하면, 그 사이의 색을 섞어서 보여준다.
pal_m<-colorRampPalette(cols)
ggplot(cleanHayoonL, aes(x=factor(disease), fill=medic))+
geom_bar(position="fill")+
labs(x='Disease codes',
y='Count of prescription',
title='Which medications were prescribed the most?',
fill = 'Drug Codes')+
scale_fill_manual(values=pal_m(32)) + #선택한 팔렛트에서 32개의 색을 사용한다고 선택한다.
theme_ft
I decided to sort each medication by disease classification code. Each box represents a disease code combination by doctor per visit. Multi-panel plots looked better than one bar chart to compare disease codes.
질병분류코드별로 의약품을 나누어 보기로 했다. 각 박스는 질병분류코드를 의미한다. x=처방받은 날짜, y=한번에 먹어야 하는 약의 용량, 색깔코드=의약품코드 질병코드별로 처방된 날짜와 의약품 보기가 훨씬 쉬워졌다.
class(cleanHayoonL$once)
[1] "character"
once_n<-as.numeric(cleanHayoonL$once)
Warning: 강제형변환에 의해 생성된 NA 입니다
scaleFUN <- function(x) sprintf("%.2f", x)
# 질병코드별 facet 그리드
ggplot(cleanHayoonL, aes(date, once_n, col=medic))+
geom_jitter(alpha=0.5, size=3, width=0.2)+
facet_wrap(~disease)+
labs(x='Dates',
y='One dose amount',
title='What kind of medicine were prescriped per disease codes?',
color = 'Drug Codes')+
theme_ft+
scale_y_continuous(labels=scaleFUN)
Warning: Removed 1 rows containing missing values (geom_point).
I decided to sort each medication by disease classification code. Each box represents a disease code combination by doctor per visit. Multi-panel plots looked better than one bar chart to compare disease codes.
의사별로 어떤 질병코드로 처방했을까?
# 의사별 분류
ggplot(cleanHayoonL, aes(date, once_n, col=doc))+
geom_point(alpha=0.5, size=3, width=0.2)+
facet_wrap(~disease)+
labs(x='Dates',
y='One dose amount',
title='Which disease codes were prescribed by doctors?',
color = 'Doctor name')+
theme_ft+
scale_y_continuous(labels=scaleFUN)+
theme(legend.text = element_text(family="AppleGothic"))
Warning: Ignoring unknown parameters: width
Warning: Removed 1 rows containing missing values (geom_point).
Disease codes are like blocks. The blocks can be combined with different codes for each doctor encounter. The graph below shows each code combination. For example, the code H6501 was diagnosed with J209, J459, and L2088 on different encounters. The density of color represents the number of medication administrations per day. The code H6501 represents otitis media and J209 is bronchitis. The graph shows that bronchitis was diagnosed several times that otitis media was also diagnosed.
좀 더 쉽게 볼 수 있는 챠트는 없을까?
x축, y축: 질병분류코드 색: 1일 투여횟수 (모든 의약품)
하루 섭취량이 가장 높은게 진한 색이어야 하는데, 반대로 되어있다.
#히트맵
cleanHayoonS<-separate(data=cleanHayoonL, col = disease, sep=", ", into = c("disease_1", "disease_2"))
ggplot(cleanHayoonS, aes(x=disease_1, disease_2, fill=one_day))+
geom_tile()+
labs(x="Disease Code 1", y="Disease Code 2", fill="Number of doses per day",
title="Which disease has the most frequency to take per day?")+
theme_ft
낮은 수는 연한 하늘색, 높은 수는 진한 하늘색으로 gradient를 준다.
ggplot(cleanHayoonS, aes(x=factor(disease_1), y=factor(disease_2), fill=one_day))+
geom_tile()+
labs(x="Disease Code 1", y="Disease Code 2", fill="Number of doses per day",
title="Which disease has the most frequency to take per day?")+
theme_ft+
scale_fill_gradient(low="#bfddff", high="#002e63")
그래프 x축 첫번째 값인 H6501 질병코드는 J209, J459, L2088과 조합해서 처방된 적이 있다는 것을 알 수 있다.
혹여나 해서 다시 필터링해 보았다. H6501 은 급성 장액성 중이염 (양쪽 귀)를 의미하는 질병코드이다. 주요질병 중이염이 걸렸을 때 J209: 상세불명의 기관지염, J459: 상세불명의 천식, L2088: 기타 아토피성 피부염을 부질병으로 처방했다는 뜻이다. 색의 진하기를 의미하는 바는 하루의 1일 투여횟수를 의미한다. 진한 파랑색은 1일 투여횟수가 많다는 의미다. 1일 투여되는 약은 처방된 모든 의약품이다. 즉, 진한 파랑색으로 유추할 수 있는 것은
히트맵은 위와 같은 사항으로 정확한 사실을 전달해 주는 것이 어렵게 느껴진다. 다른 그래프를 그려봐야 겠다.
cleanHayoonS %>%
filter(disease_1 == "H6501") %>%
group_by(disease_2) %>%
tally()
# A tibble: 3 x 2
disease_2 n
<chr> <int>
1 J209 6
2 J459 11
3 L2088 7
The most important takeaway from all this was that ordinary citizens do not have much right to own or use their own data. If people were able to easily access and browse their own data it would help them to be healthier and protect them from overprescription of medications.
Second, this exercise helped me to see data from different perspectives. An idle curiosity moved me to learn R programming language to wrangling the data. It took more than a year to become versed in R. This personal project motivated me to learn R, and I have become more savvy at data analysis and data visualization because of it.
시각화 챠트를 만드는 과정에서 데이터를 보는 여러 관점으로 상황을 파악할 수 있었다.
일자별 처방 데이터 추가 입력시 자동 통계 프로그램