How Old One can be

Business Problem

I have been presented with this patient related data and some diagnosis information. The ask really is to verify that the birth year information is correct. If not can it be treated or fixed.

Note: This is an anonymized patient data and in no way to can it relate to any real individual.

Data Input
setwd("C:/Users/priyanka.gagneja/Downloads/data")

diagnosis<-read.csv("diagnosis.csv")
patient<-read.csv("patient.csv")

class(patient); class(diagnosis); 
## [1] "data.frame"
## [1] "data.frame"
str(patient); str(diagnosis); 
## 'data.frame':    47181 obs. of  3 variables:
##  $ patient_id: int  2 3 4 5 6 7 8 9 10 11 ...
##  $ birth_year: int  1938 1999 1959 1970 1964 1943 1984 1959 1952 1981 ...
##  $ gender    : Factor w/ 2 levels "f","m": 2 1 2 1 2 1 1 1 2 2 ...
## 'data.frame':    874916 obs. of  4 variables:
##  $ patient_id : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ code_system: Factor w/ 2 levels "UMLS:ICD10CM",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ code       : Factor w/ 100842 levels "003.20","003.21",..: 24587 78904 60300 36073 10668 15330 26061 20630 93769 8402 ...
##  $ year       : int  2012 2012 2012 2013 2013 2013 2013 2013 2013 2013 ...
dim(patient); dim(diagnosis); 
## [1] 47181     3
## [1] 874916      4

These data files are simple versions of patient and diagnosis information.

  • Patient file has demographic information of the patient. gender and birth_year along with a patient_id corresponding to each patient.
  • Diagnosis file on the otherhand has details of visits and diagnoses of each patient.

Back to top

Data Exploration
summary(patient)
##    patient_id      birth_year   gender   
##  Min.   :    2   Min.   :1900   f:25794  
##  1st Qu.:12522   1st Qu.:1952   m:21387  
##  Median :25035   Median :1974            
##  Mean   :25020   Mean   :1973            
##  3rd Qu.:37522   3rd Qu.:1995            
##  Max.   :50001   Max.   :2118
hist(patient$birth_year)

#aggregate(patient_id ~ birth_year, patient, length)

nrow(patient[patient$birth_year>2018,])
## [1] 123
treat_p<-patient[patient$birth_year>2018,]
dim(treat_p)
## [1] 123   3

Keeping the focus on birth_year I see that there is quite a bit of variation in this variable and also has some unreasonable values ( birth_year after current year(2018)).

There are 214 cases which indicate patients born in 1900 or 118 yrs old. This seems questionable but maye not impossible. We would keep these cases for now and observe its diagnosis codes (count , first visits information etc).

summary(diagnosis)
##    patient_id          code_system           code             year     
##  Min.   :    2   UMLS:ICD10CM:571683   V95.41XS:  7466   Min.   :1900  
##  1st Qu.:12490   UMLS:ICD9CM :303233   V95.41XA:  7464   1st Qu.:2009  
##  Median :25005                         V95.41XD:  7455   Median :2010  
##  Mean   :25000                         T71.234S:   164   Mean   :2010  
##  3rd Qu.:37505                         T71.232A:   163   3rd Qu.:2013  
##  Max.   :50001                         T71.231 :   162   Max.   :2118  
##                                        (Other) :852042
hist(diagnosis$year)

#aggregate(patient_id ~ year, diagnosis, length)

nrow(diagnosis[diagnosis$year>2018,])
## [1] 3846
treat_d<-diagnosis[diagnosis$year>2018,]
dim(treat_d)
## [1] 3846    4
length(unique(treat_d$patient_id))
## [1] 3702

Below is a sample of only 25 of the 396 ICD10CM diagnostic codes related to newborns. I will work with this sample list for simplicity (I did not find a downloadable version of ICD10 codes) and to design the code.

code<-c("P92.4","L08.82","P92.3","P38","P02.7","P28.19","P92.09","P92.1","P92.9","N47.0","P01.0","P01.3","P01.4","P19","P29.30","P36.9","P53","P80.8","P80.9","P83.30","P92.2","P01.5","P22","P83.4","P03.5")
code_system<-"UMLS:ICD10CM"
codes<-data.frame(cbind(code,code_system))

For cases where diagnosis codes match with newborn/infant and have unreasonable birth_year, I will replace birth_year with least value of year (indicating first visit of the patient).

# Keeping patient file as the base
p_d<-merge(patient,diagnosis, by = "patient_id", all.x = TRUE, all.y = FALSE)
p_d_nb<-merge(p_d,codes, by = c("code_system","code"), all.x = FALSE, all.y = FALSE)

# x<-merge(patient,diagnosis, by = "patient_id", all.x = FALSE, all.y = TRUE)
# length(unique(x$patient_id))     # 49713 ==> 49713-46894 = 2819 patients indiagnosis do not have birth/gender info.

summary(p_d)
##    patient_id      birth_year   gender           code_system    
##  Min.   :    2   Min.   :1900   f:451291   UMLS:ICD10CM:539260  
##  1st Qu.:12511   1st Qu.:1952   m:374283   UMLS:ICD9CM :286027  
##  Median :25037   Median :1974              NA's        :   287  
##  Mean   :25018   Mean   :1973                                   
##  3rd Qu.:37526   3rd Qu.:1995                                   
##  Max.   :50001   Max.   :2118                                   
##                                                                 
##        code             year     
##  V95.41XS:  7058   Min.   :1900  
##  V95.41XA:  7034   1st Qu.:2009  
##  V95.41XD:  7012   Median :2010  
##  T71.232D:   156   Mean   :2010  
##  T71.234S:   156   3rd Qu.:2013  
##  (Other) :803871   Max.   :2118  
##  NA's    :   287   NA's   :287

Unfortunately, there are no such cases in this data. I see here that there are also cases where patients birth_year is recorded as after year (patients first visit. Wait!! Really?? A medical visit before being born :-| ).

library(reshape2)
pd_firstvisit<-aggregate( year ~ patient_id , p_d, min)   
# 287 patient info without diagnostic info got lost in the process

pd_firstvisit2<-merge(pd_firstvisit,patient,by = "patient_id" ,all.x=TRUE)


# First visit before birth (birth_year) 
bad_data<-pd_firstvisit2[pd_firstvisit2$year < pd_firstvisit2$birth_year,]  
length(unique(bad_data$patient_id))        #6822
## [1] 6822
nrow(bad_data[bad_data$birth_year>2018,])  #121
## [1] 121

121 patient_id out of 123 (identified above) correspond to data incorrectly recorded for birth_year and year both. It is better to exclude these data points unless for the presence of more information (age e.g. would be easiest to look at to verify and fix the birth_year).

# First visit after birth (birth_year) 
keep_p<-pd_firstvisit2[pd_firstvisit2$year >=  pd_firstvisit2$birth_year,] 
hist(keep_p$birth_year)

hist(keep_p$year)

Back to top

Interpretation

121 patient_IDs with bad birth_year as well as year of diagnosis need to be removed.

Close to 14% (6822 records) of the patients out of ~47K have been discarded/ excluded in the given situation of limited data (no age information or ICD10 codes for NewBorns etc. that may have helped fix some of the above).

Summary of data

Description No. of unique patient_ids
# patients in patient file 47181
# patients in diagnosis file 49713
# patients in both patient & diagnosis file 46894
# patients with no diagnosis information 287
# patients with diagnosis info but no patient info 2819

Back to top