Our health care system is characterized by high and rising healthcare costs as well as gaps in quality, safety, equity, and access. The length of stay in hospitals is often used as an indicator of the efficiency of care and hospital performance. It is generally recognized that a shorter stay indicates less resource consumption per discharge, and the ability to predict the length of stay as an initial assessment of patients’ risk is critical for better resource planning and allocation. Another hospital performance indicator is hospital readmission. Hospital readmission is a high-priority health care quality measurement and target for cost reduction. Despite broad interest in readmission, relatively little research has focused on patients with diabetes. The burden of diabetes among hospitalized patients is substantial, growing, and costly, and readmissions contribute a significant portion of this burden. Reducing readmission rates and length of stay in hospitals of diabetic patients has the potential to greatly reduce health care costs while simultaneously improving care.
The first question Group 4 explored is “Can we construct a linear regression model to predict the length of stay in hospital using a combination of the variables in our data set?” The purpose of this question is to predict the length of stay in hospitals as an initial assessment of patients’ risk so that hospital management teams can have higher flexibility in hospital bed use and better assessment in the cost-effectiveness treatment. We envisioned our model to be most helpful for bed managers to foresee any bottlenecks in bed availability when admitting patients to avoid unnecessary bed transfer between wards.
The second question Group 4 investigated is “Can we construct a logistic regression model to predict the risk of getting readmitted using a combination of the variables in our data set?” By exploring the relationship between hospital readmissions and other variables in our data set, we strive to understand the risk factors that lead to hospital readmissions within 30 days of discharge. We would like to construct and compare different models to optimize prediction performances so that hospitals can design well-targeted early interventional programs to reduce readmission risks, such as inpatient education, specialty care, better discharge instructions, coordination of care, and post-discharge support.
We found this dataset on Kaggle. A user named “Humberto Brandão” uploaded it to Kaggle four years ago. The dataset is created by researchers at the Virginia Commonwealth University by pulling data from the Health Facts database, a national data warehouse that collects comprehensive clinical records across 130 hospitals in the United States from 1999 to 2008. The dataset has been used by the same group of researchers to explore the impact of HbA1c measurement on hospital readmission rates, but there are still many variables left untouched in their analysis, among which we chose to investigate. By removing the NA’s and missing values, we constructed our final cleaned data set, which contains 19 features in total (including patient number) along with 57,222 observations.
We used almost every feature in the dataset except for patient number (Patient_Nbr) and total hospital visits (Number_Of_Visits_Total). Patient number is just a unique identifier for all the patients in the dataset, so the number itself doesn’t give us any information. Total hospital visits measures the total number of times a patient visited the hospital from 1999 to 2008. Since we only want to predict the most recent and immediate readmission risk, we focus on the binary variable “Readmitted” and dropped total hospital visits in our final model. After our modification, the final clean dataset that we utilized for further analysis contains 17 variables and 57,222 observations. The table below is a preview of our final dataset.
| Race | Gender | Age | Admission_Type | Discharge_Disposition | Admission_Source | Time_In_Hospital | Num_Medications | Number_Diagnoses | Num_Procedures_Total | Num_Visit_Previous_Year | Primary_Diag | Max_Glu_Serum | A1Cresult | Med_Change | Diabetes_Med_Prescribed | Readmitted |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Caucasian | Female | [10-20) | Emergency | DischargeToHome | Other | 3 | 18 | 9 | 59 | 0 | Neoplasms | None | None | Yes | Yes | No |
| AfricanAmerican | Female | [20-30) | Emergency | DischargeToHome | Other | 2 | 13 | 6 | 16 | 3 | Other | None | None | No | Yes | No |
| Caucasian | Male | [30-40) | Emergency | DischargeToHome | Other | 2 | 16 | 7 | 45 | 0 | Other | None | None | Yes | Yes | No |
| Caucasian | Male | [40-50) | Emergency | DischargeToHome | Other | 1 | 8 | 5 | 51 | 0 | Neoplasms | None | None | Yes | Yes | No |
| Caucasian | Male | [50-60) | Urgent | DischargeToHome | Referral | 3 | 16 | 9 | 37 | 0 | Circulatory | None | None | No | Yes | No |
| Caucasian | Male | [60-70) | Elective | DischargeToHome | Referral | 4 | 21 | 7 | 71 | 0 | Circulatory | None | None | Yes | Yes | No |
| Caucasian | Male | [70-80) | Emergency | DischargeToHome | Other | 5 | 12 | 8 | 73 | 0 | Circulatory | None | None | No | Yes | No |
| Caucasian | Female | [80-90) | Urgent | DischargeToHome | Transfer | 13 | 28 | 8 | 70 | 0 | Circulatory | None | None | Yes | Yes | No |
| Caucasian | Female | [90-100) | Elective | DischargeToFacilityWithCare | Transfer | 12 | 18 | 8 | 36 | 0 | Circulatory | None | None | Yes | Yes | No |
| AfricanAmerican | Female | [40-50) | Emergency | DischargeToHome | Other | 9 | 17 | 9 | 49 | 0 | Diabetes | None | None | No | Yes | No |
Since we are predicting the length of stay in hospitals (Time_In_Hospital) and whether a patient is readmitted (Readmitted), we created the following two frequency plot to visualize the distribution of the data.
To provide more information on what each variable stands for in a real-world context, we created the following table that contains the name of the variable, the variable type, a short description of the variable, and a summary of the values that it can take.
| Feature.Name | Type | Description | Values |
|---|---|---|---|
| Race | Categorical | The race of the patient | Caucasian, Asian, African American, Hispanic, and other |
| Gender | Categorical | The biological sex the patient | Male and Female |
| Age | Categorical | The age of the patient | [0, 10), [10, 20), . . ., [90, 100) |
| Admission_Type | Categorical | Admission Type is used to classify how the patient came to the hospital | Emergency, Trauma Center, Urgent, Elective, and Newborn |
| Discharge_Disposition | Categorical | The patient’s anticipated location or status following the encounter | DischargeToHome, DischargeToFacilityWithCare, and Other |
| Admission_Source | Categorical | The source of admission to a hospital | Transfer, Referral, Birth, and Other |
| Primary_Diag | Categorical | The main condition treated or investigated during the relevant episode of healthcare | Circulatory, Respiratory, Digestive, Diabetes, Injury, Musculoskeletal, Genitourinary, Neoplasms, and Other |
| Max_Glu_Serum | Categorical | Indicates the range of the blood glucose level at the time of of hospital admission | “>200” (between 200-300), “>300” (greater than 300), “Normal” (less than 140), and “None” (not measured) |
| A1Cresult | Categorical | The measurement of Hemoglobin A1c (an important measure of glucose control) at the time of hospital admission | “>8” (greater than 8%), “>7” (between 7% to 8%), “Normal” (under 7%), and “None” (not measured) |
| Med_Change | Categorical | Indicates if there was a change in diabetic medications (either dosage or brand) | Yes or No |
| Diabetes_Med_Prescribed | Categorical | Indicates if there was any diabetic medication prescribed | Yes or No |
| Readmitted | Categorical | Indicates if the patient was readmitted within 30 days of discharge | Yes or No |
| Time_In_Hospital | Numeric | Integer number of days between admission and discharge | Integer |
| Num_Medications | Numeric | Number of distinct generic names administered during the encounter | Integer |
| Number_Diagnoses | Numeric | Number of diagnoses entered to the system | Integer |
| Num_Procedures_Total | Numeric | Number of total procedures (lab tests and other) performed during the encounter | Integer |
| Num_Visit_Previous_Year | Numeric | Number of total visits (outpatient, emergency, and inpatient) of the patient in the year preceding the encounter | Integer |
In our initial analysis, we found out an interesting relationship between Time_In_Hospital (the length of stay in hospital) and Readmitted (whether the patient is readmitted). From the relative frequency plot below, It’s clear that compared to patients who got readmitted, patients who did not get readmitted tend to spend less time in the hospital. Specifically, for patients who did not get readmitted, 64% of them spent less than or equal to 4 days in the hospital. However, for readmitted patients, only 56% of them spent less than or equal to 4 days in the hospital. The side-by-side relative frequency plot demonstrates the interesting relationship between the two variables of interest.
In order to attempt to answer the second question, “How can we most accurately predict the length of stay in the hospital for a patient?” We first had to exclude certain variables that were not a good fit for the model. Many variables were excluded from this prediction due to the complications of medical variables. Some of these complications included the broadness of certain variables that, if added, would not help predict the length of stay in a hospital due to extreme outliers they produced. Secondly, we split our data through random sampling into train sets, which composed 80% of the data, and test sets, which composed the latter 20% of the data. Lastly, with all the variables that were left remaining, we constructed a linear model that best predicted the length of stay in a hospital.
| med_linmod | proc_linmod | diag_linmod | |
|---|---|---|---|
| MAE | 1.970802 | 2.115576 | 2.205688 |
This above table shows the three best mean absolute errors(MAE) that we calculated after running the equation for all of our variables.
As you can see from the above linear regression plots, linearity seems to hold well as the red line sits close the black dashed line in the Residuals vs Fitted plot. From reading the Normal Q-Q graph we can tell that values in the later quantiles have more frequent outliers compared to values in the earlier quantiles that seem to be more normally distributed.This means that our model tends to be more right-skewed distributed. Finally, from reading the Residuals vs Leverage plot we can tell that all of our leverage points are below .5. This shows that none of the observed points will not have a great enough of impact to skew the data severely. From analyzing the above plots, we feel confident that our linear model will do a good job in predicting the length of stay for patients in the hospital.
M= Number of Medications
P= Number of Total Procedures
D= Number of Diagnoses
This above equation shows numerical representation of our model predicting the time in hospital for a given patient. This equation gives us the intercept(-0.4567879) of our linear model along with the given slopes for each predictor variable that we chose.
The results for this linear regression are quite interesting, the three variables that were most influential in the estimation of the time spent in the hospital are the number of procedures total, the number of medications, and the number of diagnoses. In the first 3 graphs regarding time spent in the hospital it is clear that all three of the variables of influence all have a weak positive correlation. However in our final model regarding time spent in the hospital we have included all three of these variables along with a line of prediction in the same visual. In this final model, the time spent in the hospital is represented by the y-axis, the number of procedures total is represented by the x-axis, the number of medications is represented by color (dark blue represents low numbers, while light blue represents high numbers), and the number of diagnoses are represented by the size of the points. Our data set contained such a large number of observations that even with a seemingly weak correlation between time in the hospital and the number of procedures total, the number of medications, and the number of diagnoses, there is little error margin for the prediction line. The little margin of error shows that though the line isn’t always spot on it is the best representation to predict the time in the hospital, given the data.
In our second question, we attempt to fit a model to predict whether a patient is readmitted by a hospital or not. Before entering the model construction process, we balanced our data set in three ways: adding weight, random under-sampling, and random oversampling. As the pie chart shows that about 91% of patients in our sample are not readmitted by the hospital, and only 8.6% of patients are readmitted, our data set is imbalanced. In our case, the not-readmitted class has a much bigger sample size than readmitted class, so when we first tried to fit a model, we have a high accuracy by predicting the not-readmitted class but fail to capture any of the readmitted class. Therefore, we decided to balance our dataset first, and then construct models.
First, we manually assigned different weights to the majority class (not-readmitted) and minority class(readmitted) by using the formula behind “class_weight = balanced,” which is \[ n_samples / (n_classes * np.bincount(y))\]. The weight assigned to the not-readmitted class is about 0.547, and the weight assigned to the readmitted class is 5.793. Second, we used random under-sampling to reduce the number of observations from the not-readmitted class to make the data set balanced. The random under-sampling method randomly chooses observations from the not-readmitted class which are eliminated until the number of observations from the not-readmitted class is the same as the number of observations from readmitted class. Third, we used random over-sampling to replicates the observations from readmitted class to balance the data. Similar to the random under-sampling, this method randomly replicates the readmitted class until the number of observations from the readmitted class is the same as the number of observations from the not-readmitted class.
After we created 3 balanced datasets by using the three different methods, we separate the 3 datasets into 3 pairs of Train-Test sets. Since we have 17 potential predictors, we decided to use LASSO (Least Absolute Shrinkage and Selection Operator) regularization method to select the most useful and important variables for us to predict the binary response “whether readmitted or not.” LASSO offers a neat and easy way to model the response variable while automatically selecting significant variables by shrinking the coefficients of unimportant predictors to zero, so we do not need to check each variable’s p-value. Then, we applied LASSO regularized method to the 3 balanced datasets. The three models’ accuracy and f1 score are shown in the table below.
| Balance.Method | Model | Accuracy | F.measure |
|---|---|---|---|
| Unbalanced | Model_Unbalanced | 0.9088590 | 0.0000000 |
| Weight | Model_Weighted | 0.6087662 | 0.2031176 |
| Under-sampling | Model_Under | 0.4359150 | 0.4720975 |
| Over-sampling | Model_Over | 0.4071738 | 0.3909144 |
F-measure is the harmonic mean of precision and recall, which balanced both the concerns of precision and recall in one number. \[ F-Measure = (2 * Precision * Recall) / (Precision + Recall) \] In our case, to find the best model fitted our imbalanced data and capture the readmitted class(minority), we should not only consider the accuracy but also the F-measure of the model. As the results shown in the previous table, the model created in the dataset balanced by assigned weight is the most reasonable choice. Although the accuracy decreases compared to the model created by unbalanced data, it improves F-measure by 20 %. Also, the Under-sampling model has the Highest F-measure, but the accuracy is too low to make any meaningful inference. Therefore, we select Model_Weighted as our final model. The graph below visualized the confusion table of our final model, and the table lists all predictor and their coefficient.
| Parameter | Estimate |
|---|---|
| Int | -0.1341334 |
| Race | 0.0005205 |
| Gender | -0.0053869 |
| Age | 0.0216646 |
| Admission_Type | 0.0052035 |
| Discharge_Disposition | -0.1369527 |
| Admission_Source | -0.0071363 |
| Time_In_Hospital | 0.0113194 |
| Num_Medications | 0.0018743 |
| Number_Diagnoses | 0.0192903 |
| Num_Procedures_Total | 0.0010003 |
| Num_Visit_Previous_Year | 0.0327473 |
| Primary_Diag | -0.0047095 |
| Max_Glu_Serum | -0.0530045 |
| A1Cresult | 0.0125360 |
| Med_Change | 0.0147118 |
| Diabetes_Med_Prescribed | 0.0530578 |
As our final model’s coefficient table shows, all 16 variables in the data are useful and important. The LASSO method does not rule out any variable, so we can make a conclusion that whether patients are readmitted by hospitals or not is influenced by all 16 variables shown in the above table. Among them, Discharge_Disposition has the biggest absolute coefficient value, which means it has the strongest relationship with whether the patient is readmitted or not.
The goal of our first question was to predict the length of stay in hospital (Time_In_Hospital). Our key finding is that Num_Medications, Num_Procedures_Total, and Number_Diagnoses all have a positive correlation with Time_In_Hospital. We arrived at this conclusion by first constructing a linear regression model for each of the predictor variables to get a vague idea of the variable’s association/correlation in regard to Time_In_Hospital. Then we started to weed out the non-influential variables by calculating the mean absolute error for each combination of variables. We also used Adjusted-R-squared to help us narrow it down even more. It’s not surprising to us that the three best predictor variables were Num_Medications, Num_Procedures_Total, and Number_Diagnoses. All three of these variables had the highest R-squared value, lowest mean absolute error value, and p-values that made them significant. It’s quite surprising that adding categorical variables to our model did not greatly improve our prediction accuracy.
The goal of our second question was to predict whether a patient will get readmitted (Readmitted). Since only 10% of the total patients got readmitted to the hospital, we first need to balance our dataset before performing any analysis. We applied three balancing techniques, including weight assignment, under-sampling, and oversampling. After balancing our highly skewed dataset and applying LASSO regression, we compared the four models in the test dataset and identified the best model, which is constructed using weight assignment. It’s quite surprising that the LASSO returned all variables in the dataset as significant. In our best model, we included all the 16 variables in our dataset and reached a 60% predicting accuracy.
These results are especially relevant in the real world considering hospital readmission and length of stay in hospitals are two important health care quality measurements and drivers of costs. The risk variables that we identified for readmission and longer length of stay can be beneficial to both hospital administrative staff and physicians. knowing the 3 risk variable for longer length of stay, hospital managers can develop a better capacity planning early into a patient’s treatment. After knowing the number of total procedures, the number of medications, and the number of diagnoses, supply chain specialists can foresee any bottlenecks in resource availability and hospital capacity in the future to avoid unnecessary resource shortages. The 16 risk factors that contribute to hospital readmission require a mix of potential strategies for reducing readmission risks, such as inpatient education, specialty care, better discharge instructions, coordination of care, and post-discharge support. Particularly, for patients with diabetes, diabetes-specific strategies such as diabetes education, intensifying therapy, and outpatient diabetes care can be deployed accordingly to further reduce readmission risks.
While the dataset was comprehensive, we think certain additions or changes can help us construct a more robust analysis. Given the complexity of the hospital readmission problem and our logistic regression model’s 60% accuracy, it’s likely that variables not collected in this dataset are significant in predicting whether a patient gets readmitted. We felt that Factors like one’s socio-economic status, level of social support (living alone or not), geographical location all have the potential to affect hospital readmission. Therefore, future studies on hospital readmission should try to collect more features and look into the factors that are not discussed in our paper. Another potential improvement is to complete a grid search when assigning weight to the logistic regression model. In our analysis, we balanced our dataset by assigning a weight of 10 to the smaller sample and 1 to the larger sample. There are many other ways to balance an unbalanced dataset, so in future studies, one can use grid search to determine the optimal weight assignment that can reach a better overall accuracy and f score. Last but not least, adding interaction terms is also a good starting point for future research. In our analysis, we performed a preliminary LASSO regression using all the variables and their interaction terms to fit the model, but we ended up getting more than 800 significant predictor variables in our model. We didn’t continue working on modeling with interaction terms, but more modeling techniques could be applied to potentially reduce the number of significant predictor variables and reach a better prediction performance.