Rpubs Link: http://rpubs.com/umais/data607-FinalProject
Github Link: https://github.com/umais/DATA-607/tree/master/FinalProject
In this Project I will be analyzing a subset of dummy patients and associated pharmacy data retrieved from files in to a MYSQL Data store . All of this data is de identified data so nothing in this data identifies real patient data. One of my data sources will be csv file that I will be retrieving from a third party software called John Hopkins ACG System.
The purpose of the project is to use the Pharmacy data to determine the Health Risk of Patients and also determine what conditions the patients have based on the drugs being taken. Any company that deals with patients whether it is doctors, health insurance companies or Care Organizations can use this type of analysis to detemine which patients to focus their care coordination efforts on.
I have set the eval value of the below r code to false as the below script is executed from a r Script file which is also a part of this final project. What this code is doing is a one time dump in to My Sql Tables Patient and Pharmacy from the CSV files which can also be found in the inputfiles directory.
library(RMySQL)
library(lubridate)
killDbConnections <- function () {
all_cons <- dbListConnections(MySQL())
print(all_cons)
for(con in all_cons)
+ dbDisconnect(con)
print(paste(length(all_cons), " connections killed."))
}
killDbConnections()
mydb = dbConnect(MySQL(), user='root', password='Welcome@1', dbname='finalproject', host='localhost')
Patient <- read.csv("inputFiles/patients.csv", header = TRUE,sep=",")
Patient$DOB<-as.Date(parse_date_time(Patient$DOB,"mdy"))
head(Patient)
dbWriteTable(mydb, "patient", Patient, overwrite=FALSE,append=TRUE,row.names=FALSE)
Pharmacy <- read.csv("inputFiles/pharmacy.csv", header = TRUE,sep=",")
Pharmacy$FillDate<-as.Date(parse_date_time(Pharmacy$FillDate,"mdy"))
head(Pharmacy)
dbWriteTable(mydb, "prescriptionfillhistory", Pharmacy, overwrite=FALSE,append=TRUE,row.names=FALSE)
In the Code below we will verify that the data loaded sucessfully and we can view the records in the tables.
mydb = dbConnect(MySQL(), user='root', password='Welcome@1', dbname='finalproject', host='localhost')
rs = dbSendQuery(mydb, "SELECT * FROM patient;")
PatientData=fetch(rs, n=-1)
head(PatientData)
## BeneficiaryNumber FirstName LastName DOB
## 1 99350001797 PHIL JERASA 1948-05-05
## 2 99350002197 CENTRELLA DINANT 1951-01-04
## 3 99350002397 J MERRY 1942-06-15
## 4 99350005697 DONITA FERREY JR 1939-09-29
## 5 99350005897 JADA CHARNICK 1926-05-11
## 6 99350006297 AVRUM HUMMON 1942-09-18
rs = dbSendQuery(mydb, "SELECT * FROM PrescriptionFillHistory;")
PrescriptionHistory=fetch(rs, n=-1)
head(PrescriptionHistory)
## BeneficiaryNumber NDC NPI FillDate Supply
## 1 99350001797 955100410 8951265221 2016-06-18 4
## 2 99350001797 16729018317 2448858131 2016-08-13 30
## 3 99350001797 65862056090 4131612881 2016-06-30 30
## 4 99350001797 93965201 2046894101 2016-09-11 5
## 5 99350001797 93220305 8951265221 2016-05-26 1
## 6 99350001797 406012301 8951265221 2016-06-14 3
John Hopkins ACG takes two files as input the patient File and the Pharmacy File. The Pharmacy file is optional but it is required for the purposes of this project. John Hopkins ACG Software processes these two files and then produces two output files one file is the ACG output file and the other one is ACG Conditions File. We will be now analyzing the output that was produced by ACG
PatientACG <- read.csv("inputFiles/ACGOutput.acgd.csv", header = TRUE,sep=",")
head(PatientACG$rescaled_pharmacy_cost_predicted_risk)
## [1] 0.6047990 1.7241220 0.6378245 0.5098452 0.4584118 3.0624195
In the data returned by John Hopkins ACG Software I will be looking at the column rescaled_pharmacy_cost_predicted_risk. I will calculate the mean and Standard Deviation for this column so that I can use that to place the patients in different categories whether they are very low risk, low risk, medium risk , High Risk or Very High Risk.
What I will do here is use my own algorithm to calculate the Risk Level for a patient. As Per John Hopkins ACG software the Mean of the Rescaled Pharmacy cost is 1. So if the value is 1 in this column that means this is how much an average patient will cost based on the drugs they are taking.
The algorithm is below
if a patient cost is 0.5 then I am categorizing them as very low (1).
If the patient is greater than 0.5 but less than equal to the mean 1 then they are low risk (2).
If the patient cost is the mean + standard deviation then they are Medium Risk (3)
If the patient cost index is mean plus standard deviation multiply by 3 then they are High Risk
Finally if they are higher than that then they are very risky.
CalculateRiskLevel <- function(predictedCost,mean,sdvcost)
{
if(predictedCost<0.5)
return(1)
if(predictedCost<=1)
return(2)
if(predictedCost<=mean+sdvcost)
return(3)
if(predictedCost<=(mean+(sdvcost*3)))
return(4)
else
return(5)
}
Mean<-mean(PatientACG$rescaled_pharmacy_cost_predicted_risk)
SDVCost<-sd(PatientACG$rescaled_pharmacy_cost_predicted_risk)
Mean
## [1] 0.9110616
SDVCost
## [1] 0.9699812
PatientACG$RiskLevel= sapply(PatientACG$rescaled_pharmacy_cost_predicted_risk, CalculateRiskLevel,Mean,SDVCost)
head(PatientACG$RiskLevel)
## [1] 2 3 2 2 1 4
hist(PatientACG$RiskLevel, col="darkgreen")
xyplot(RiskLevel ~ age, data = PatientACG,
xlab = "Age",
ylab = "Risk Level",
main = "Relationship between age and Risk Level")
dotplot(RiskLevel ~ age, data = PatientACG, groups = RiskLevel, type = "o",
auto.key = list(space = "right", points = TRUE, lines = TRUE))
In the below code we will be creating counts of the population categorizing them in different buckets from Risk Level very low to Very High
colours <- c("grey", "green", "yellow", "orange", "red")
VeryLow = nrow(subset(PatientACG, RiskLevel==1))
Low = nrow(subset(PatientACG, RiskLevel==2))
Medium = nrow(subset(PatientACG, RiskLevel==3))
High = nrow(subset(PatientACG, RiskLevel==4))
VeryHigh = nrow(subset(PatientACG, RiskLevel==5))
t=data.frame(VeryLow,Low,Medium,High,VeryHigh)
t
## VeryLow Low Medium High VeryHigh
## 1 394 316 191 78 21
barplot(as.matrix(t), main="My Barchart", ylab = "Population", beside=TRUE, col=colours)
We can see based on the above results that there are very few patients that are in the Very High category. This will narrow down the population that care coordination organizations have to focus on.
PatientConditions <- read.csv("inputFiles/ACG_Conditions.csv", header = TRUE,sep=",")
head(PatientConditions)
## patient_id rxmg_code rxmg_description
## 1 9.935e+10 CARx030 Cardiovascular / High Blood Pressure
## 2 9.935e+10 CARx050 Cardiovascular / Vascular Disorders
## 3 9.935e+10 GASx060 Gastrointestinal/Hepatic / Peptic Disease
## 4 9.935e+10 GSIx010 General Signs and Symptoms / Nausea and Vomiting
## 5 9.935e+10 GSIx020 General Signs and Symptoms / Pain
## 6 9.935e+10 INFx020 Infections / Acute Minor
## major_rxmg_code major_rxmg_description rxmg_impact_type
## 1 CAR Cardiovascular L
## 2 CAR Cardiovascular H
## 3 GAS Gastrointestinal/Hepatic M
## 4 GSI General Signs and Symptoms M
## 5 GSI General Signs and Symptoms M
## 6 INF Infections L
## rxmg_medical_source rxmg_pharmacy_source
## 1 N Y
## 2 N Y
## 3 N Y
## 4 N Y
## 5 N Y
## 6 N Y
As we can see that John Hopkins ACG software also produced result that can help tell us what conditions a patient has. Out of these conditions I would like to analyze the data for some conditions such as High Blood Pressure and Diabetes without Insulin.
The RxMGCode is a proprietry code from John Hopkins that is used to identify a co-morbidity (condition). In order to analyze the conditions I mentioned above I will be using the RxMG Code associated with those two conditions.
The two RxMG Codes of interest are CARx030 and ENDx040
HighBloodPressure = subset(PatientConditions, rxmg_code=="CARx030")
DiabetesWithoutInsulin= subset(PatientConditions, rxmg_code=="ENDx040")
head(HighBloodPressure)
## patient_id rxmg_code rxmg_description
## 1 99350001797 CARx030 Cardiovascular / High Blood Pressure
## 16 99350002397 CARx030 Cardiovascular / High Blood Pressure
## 19 99350002497 CARx030 Cardiovascular / High Blood Pressure
## 23 99350004197 CARx030 Cardiovascular / High Blood Pressure
## 28 99350004697 CARx030 Cardiovascular / High Blood Pressure
## 40 99350005697 CARx030 Cardiovascular / High Blood Pressure
## major_rxmg_code major_rxmg_description rxmg_impact_type
## 1 CAR Cardiovascular L
## 16 CAR Cardiovascular L
## 19 CAR Cardiovascular L
## 23 CAR Cardiovascular L
## 28 CAR Cardiovascular L
## 40 CAR Cardiovascular L
## rxmg_medical_source rxmg_pharmacy_source
## 1 N Y
## 16 N Y
## 19 N Y
## 23 N Y
## 28 N Y
## 40 N Y
head(DiabetesWithoutInsulin)
## patient_id rxmg_code rxmg_description
## 9 99350002197 ENDx040 Endocrine / Diabetes Without Insulin
## 18 99350002397 ENDx040 Endocrine / Diabetes Without Insulin
## 53 99350006497 ENDx040 Endocrine / Diabetes Without Insulin
## 58 99350007097 ENDx040 Endocrine / Diabetes Without Insulin
## 66 99350007797 ENDx040 Endocrine / Diabetes Without Insulin
## 71 99350008397 ENDx040 Endocrine / Diabetes Without Insulin
## major_rxmg_code major_rxmg_description rxmg_impact_type
## 9 END Endocrine L
## 18 END Endocrine L
## 53 END Endocrine L
## 58 END Endocrine L
## 66 END Endocrine L
## 71 END Endocrine L
## rxmg_medical_source rxmg_pharmacy_source
## 9 N Y
## 18 N Y
## 53 N Y
## 58 N Y
## 66 N Y
## 71 N Y
colours <- c("red","yellow")
totalHighBloodPressure=nrow(HighBloodPressure)
totalDiabetesWithoutInsulin=nrow(DiabetesWithoutInsulin)
data=data.frame(totalHighBloodPressure,totalDiabetesWithoutInsulin)
barplot(as.matrix(data), main="Barchart Showing High Blood Pressure and DIabetes Patients", ylab = "Population", beside=TRUE, col=colours)
I want to do some further analysis based on the gender of the patient the risk level
mosaicplot(table(PatientACG$sex,PatientACG$RiskLevel))
We can see the distribution of the Female and male in different Risk levels
One more thing I want to check is what percentage of female out of the population are at high risk versus the male.
TotalPatients=nrow(PatientACG)
totalFemaleHighRisk=nrow(subset(PatientACG,sex=="F" & RiskLevel==5))
totalMaleHighRisk=nrow(subset(PatientACG,sex=="M" & RiskLevel==5))
totalFemaleHighRisk
## [1] 13
totalMaleHighRisk
## [1] 8
print(paste("Percentage of Male with High Risk" , (totalMaleHighRisk/TotalPatients)*100.0))
## [1] "Percentage of Male with High Risk 0.8"
print(paste("Percentage of Female with High Risk" , (totalFemaleHighRisk/TotalPatients)*100))
## [1] "Percentage of Female with High Risk 1.3"
We can see based on the results that we can use the pharmacy data to determine the Risk Level of the Patients as well as see what Implied conditions they may have based on the drugs being taken. This is a very helpful way to get an insight in to what the patients condition is.
We can take this a step further and try to determine the gaps in prescription fills of the patients and see if they are being adherent in filling there prescriptions. We will not be doing that in this project but I think it is something worth looking in to for future Projects.