Business Problem
I have been presented with this patient related data and some diagnosis information. The ask really is to verify that the birth year information is correct. If not can it be treated or fixed.
Note: This is an anonymized patient data and in no way to can it relate to any real individual.
setwd("C:/Users/priyanka.gagneja/Downloads/data")
diagnosis<-read.csv("diagnosis.csv")
patient<-read.csv("patient.csv")
class(patient); class(diagnosis);
## [1] "data.frame"
## [1] "data.frame"
str(patient); str(diagnosis);
## 'data.frame': 47181 obs. of 3 variables:
## $ patient_id: int 2 3 4 5 6 7 8 9 10 11 ...
## $ birth_year: int 1938 1999 1959 1970 1964 1943 1984 1959 1952 1981 ...
## $ gender : Factor w/ 2 levels "f","m": 2 1 2 1 2 1 1 1 2 2 ...
## 'data.frame': 874916 obs. of 4 variables:
## $ patient_id : int 2 2 2 2 2 2 2 2 2 2 ...
## $ code_system: Factor w/ 2 levels "UMLS:ICD10CM",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ code : Factor w/ 100842 levels "003.20","003.21",..: 24587 78904 60300 36073 10668 15330 26061 20630 93769 8402 ...
## $ year : int 2012 2012 2012 2013 2013 2013 2013 2013 2013 2013 ...
dim(patient); dim(diagnosis);
## [1] 47181 3
## [1] 874916 4
These data files are simple versions of patient and diagnosis information.
Patient file has demographic information of the patient. gender and birth_year along with a patient_id corresponding to each patient.Diagnosis file on the otherhand has details of visits and diagnoses of each patient.summary(patient)
## patient_id birth_year gender
## Min. : 2 Min. :1900 f:25794
## 1st Qu.:12522 1st Qu.:1952 m:21387
## Median :25035 Median :1974
## Mean :25020 Mean :1973
## 3rd Qu.:37522 3rd Qu.:1995
## Max. :50001 Max. :2118
hist(patient$birth_year)
#aggregate(patient_id ~ birth_year, patient, length)
nrow(patient[patient$birth_year>2018,])
## [1] 123
treat_p<-patient[patient$birth_year>2018,]
dim(treat_p)
## [1] 123 3
Keeping the focus on birth_year I see that there is quite a bit of variation in this variable and also has some unreasonable values ( birth_year after current year(2018)).
There are 214 cases which indicate patients born in 1900 or 118 yrs old. This seems questionable but maye not impossible. We would keep these cases for now and observe its diagnosis codes (count , first visits information etc).
summary(diagnosis)
## patient_id code_system code year
## Min. : 2 UMLS:ICD10CM:571683 V95.41XS: 7466 Min. :1900
## 1st Qu.:12490 UMLS:ICD9CM :303233 V95.41XA: 7464 1st Qu.:2009
## Median :25005 V95.41XD: 7455 Median :2010
## Mean :25000 T71.234S: 164 Mean :2010
## 3rd Qu.:37505 T71.232A: 163 3rd Qu.:2013
## Max. :50001 T71.231 : 162 Max. :2118
## (Other) :852042
hist(diagnosis$year)
#aggregate(patient_id ~ year, diagnosis, length)
nrow(diagnosis[diagnosis$year>2018,])
## [1] 3846
treat_d<-diagnosis[diagnosis$year>2018,]
dim(treat_d)
## [1] 3846 4
length(unique(treat_d$patient_id))
## [1] 3702
Below is a sample of only 25 of the 396 ICD10CM diagnostic codes related to newborns. I will work with this sample list for simplicity (I did not find a downloadable version of ICD10 codes) and to design the code.
code<-c("P92.4","L08.82","P92.3","P38","P02.7","P28.19","P92.09","P92.1","P92.9","N47.0","P01.0","P01.3","P01.4","P19","P29.30","P36.9","P53","P80.8","P80.9","P83.30","P92.2","P01.5","P22","P83.4","P03.5")
code_system<-"UMLS:ICD10CM"
codes<-data.frame(cbind(code,code_system))
For cases where diagnosis codes match with newborn/infant and have unreasonable birth_year, I will replace birth_year with least value of year (indicating first visit of the patient).
# Keeping patient file as the base
p_d<-merge(patient,diagnosis, by = "patient_id", all.x = TRUE, all.y = FALSE)
p_d_nb<-merge(p_d,codes, by = c("code_system","code"), all.x = FALSE, all.y = FALSE)
# x<-merge(patient,diagnosis, by = "patient_id", all.x = FALSE, all.y = TRUE)
# length(unique(x$patient_id)) # 49713 ==> 49713-46894 = 2819 patients indiagnosis do not have birth/gender info.
summary(p_d)
## patient_id birth_year gender code_system
## Min. : 2 Min. :1900 f:451291 UMLS:ICD10CM:539260
## 1st Qu.:12511 1st Qu.:1952 m:374283 UMLS:ICD9CM :286027
## Median :25037 Median :1974 NA's : 287
## Mean :25018 Mean :1973
## 3rd Qu.:37526 3rd Qu.:1995
## Max. :50001 Max. :2118
##
## code year
## V95.41XS: 7058 Min. :1900
## V95.41XA: 7034 1st Qu.:2009
## V95.41XD: 7012 Median :2010
## T71.232D: 156 Mean :2010
## T71.234S: 156 3rd Qu.:2013
## (Other) :803871 Max. :2118
## NA's : 287 NA's :287
Unfortunately, there are no such cases in this data. I see here that there are also cases where patients birth_year is recorded as after year (patients first visit. Wait!! Really?? A medical visit before being born :-| ).
library(reshape2)
pd_firstvisit<-aggregate( year ~ patient_id , p_d, min)
# 287 patient info without diagnostic info got lost in the process
pd_firstvisit2<-merge(pd_firstvisit,patient,by = "patient_id" ,all.x=TRUE)
# First visit before birth (birth_year)
bad_data<-pd_firstvisit2[pd_firstvisit2$year < pd_firstvisit2$birth_year,]
length(unique(bad_data$patient_id)) #6822
## [1] 6822
nrow(bad_data[bad_data$birth_year>2018,]) #121
## [1] 121
121 patient_id out of 123 (identified above) correspond to data incorrectly recorded for birth_year and year both. It is better to exclude these data points unless for the presence of more information (age e.g. would be easiest to look at to verify and fix the birth_year).
# First visit after birth (birth_year)
keep_p<-pd_firstvisit2[pd_firstvisit2$year >= pd_firstvisit2$birth_year,]
hist(keep_p$birth_year)
hist(keep_p$year)
121 patient_IDs with bad birth_year as well as year of diagnosis need to be removed.
Close to 14% (6822 records) of the patients out of ~47K have been discarded/ excluded in the given situation of limited data (no age information or ICD10 codes for NewBorns etc. that may have helped fix some of the above).
Summary of data
| Description | No. of unique patient_ids |
|---|---|
| # patients in patient file | 47181 |
| # patients in diagnosis file | 49713 |
| # patients in both patient & diagnosis file | 46894 |
| # patients with no diagnosis information | 287 |
| # patients with diagnosis info but no patient info | 2819 |