library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.3.5     v purrr   0.3.2
## v tibble  2.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'stringr' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ggplot2)
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Introduction to Data

The data set describes readmitted patients in diabetics. There are about 50 variables included that cover chacterictics of readmitted and not-readmitted patients. Most variables describe the patient characteristics, medical conditions or features of medical threatments but other variables provide measures of quality and condition. The data types are varied and include discrete, continuous, and categorical (both nominal and ordinal) data.

Data Source

The data was originally published by the UCI Machine Learning Repository.There are 101766 observations that has no missing values in any of the 43 columns.I will talk about the data chacteristics, statistics and description in below.

Hypothesis

The focus of this research is to use data assess the hypothesis: 1- Can we avoid the 30 days readmission by estimating how much of an impact do features have on the readmission of diabetes? 2- Which of various prediction models would be able to predict whether the patient will be readmitted to the hospital or not via good accuracy?

# Reading 
raw_data <- read.csv("diabetic_data.csv")
kable(head(raw_data))
encounter_id patient_nbr race gender age weight admission_type_id discharge_disposition_id admission_source_id time_in_hospital payer_code medical_specialty num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone acarbose miglitol troglitazone tolazamide examide citoglipton insulin glyburide.metformin glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
2278392 8222157 Caucasian Female [0-10) ? 6 25 1 1 ? Pediatrics-Endocrinology 41 0 1 0 0 0 250.83 ? ? 1 None None No No No No No No No No No No No No No No No No No No No No No No No No No NO
149190 55629189 Caucasian Female [10-20) ? 1 1 7 3 ? ? 59 0 18 0 0 0 276 250.01 255 9 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes >30
64410 86047875 AfricanAmerican Female [20-30) ? 1 1 7 2 ? ? 11 5 13 2 0 1 648 250 V27 6 None None No No No No No No Steady No No No No No No No No No No No No No No No No No Yes NO
500364 82442376 Caucasian Male [30-40) ? 1 1 7 2 ? ? 44 1 16 0 0 0 8 250.43 403 7 None None No No No No No No No No No No No No No No No No No Up No No No No No Ch Yes NO
16680 42519267 Caucasian Male [40-50) ? 1 1 7 1 ? ? 51 0 8 0 0 0 197 157 250 5 None None No No No No No No Steady No No No No No No No No No No Steady No No No No No Ch Yes NO
35754 82637451 Caucasian Male [50-60) ? 2 1 2 3 ? ? 31 6 16 0 0 0 414 411 250 9 None None No No No No No No No No No No No No No No No No No Steady No No No No No No Yes >30
dim(raw_data)
## [1] 101766     50

Data Description and Type:

The diabetes dataset we have used consists of 100,000 records and 50 features. Below given is the detailed description of 10 of the important features in the dataset.

  1. Encounter ID: TYPE : Continuous . DESC : a unique identifier for each claim record

  2. Race: TYPE : Categorical . DESC : African-American, Asian, Caucasian, Hispanic, Other.

  3. Gender: TYPE : Categorical. DESC : Male, Female, Unknown.

  4. Age: TYPE : Categorical. DESC: [0-10), [10-20), [20-30), [30-40), [40-50), [50-60), [60-70), [70-80), [80-90), [90-100).

  5. Weight: TYPE : Continuous DESC: Weights for patientS.

  6. Admission type ID: TYPE : Categorical DESC: Elective, Emergency, Newborn, Not Available, Not Mapped, Urgent.

  7. Discharge disposition ID: TYPE : Categorical. DESC: Admitted as an inpatient to this hospital

  8. Admission source ID: TYPE : Categorical. DESC: Clinic Referral Court/Law, Enforcement, Emergency Room, HMO Referral

  9. Time in hospital: TYPE : Numerical DESC: Time (Days) stayed in the hospital by the patientS.

  10. Readmitted: TYPE: Categorical. DESC : FALSE, TRUE.

Summary Statistics and Descripton

The current data set is composed of 101766 records and 50 features.

summary(raw_data)
##   encounter_id        patient_nbr                     race      
##  Min.   :    12522   Min.   :      135   ?              : 2273  
##  1st Qu.: 84961194   1st Qu.: 23413221   AfricanAmerican:19210  
##  Median :152388987   Median : 45505143   Asian          :  641  
##  Mean   :165201646   Mean   : 54330401   Caucasian      :76099  
##  3rd Qu.:230270888   3rd Qu.: 87545950   Hispanic       : 2037  
##  Max.   :443867222   Max.   :189502619   Other          : 1506  
##                                                                 
##              gender           age              weight      admission_type_id
##  Female         :54708   [70-80):26068   ?        :98569   Min.   :1.000    
##  Male           :47055   [60-70):22483   [75-100) : 1336   1st Qu.:1.000    
##  Unknown/Invalid:    3   [50-60):17256   [50-75)  :  897   Median :1.000    
##                          [80-90):17197   [100-125):  625   Mean   :2.024    
##                          [40-50): 9685   [125-150):  145   3rd Qu.:3.000    
##                          [30-40): 3775   [25-50)  :   97   Max.   :8.000    
##                          (Other): 5302   (Other)  :   97                    
##  discharge_disposition_id admission_source_id time_in_hospital   payer_code   
##  Min.   : 1.000           Min.   : 1.000      Min.   : 1.000   ?      :40256  
##  1st Qu.: 1.000           1st Qu.: 1.000      1st Qu.: 2.000   MC     :32439  
##  Median : 1.000           Median : 7.000      Median : 4.000   HM     : 6274  
##  Mean   : 3.716           Mean   : 5.754      Mean   : 4.396   SP     : 5007  
##  3rd Qu.: 4.000           3rd Qu.: 7.000      3rd Qu.: 6.000   BC     : 4655  
##  Max.   :28.000           Max.   :25.000      Max.   :14.000   MD     : 3532  
##                                                                (Other): 9603  
##               medical_specialty num_lab_procedures num_procedures
##  ?                     :49949   Min.   :  1.0      Min.   :0.00  
##  InternalMedicine      :14635   1st Qu.: 31.0      1st Qu.:0.00  
##  Emergency/Trauma      : 7565   Median : 44.0      Median :1.00  
##  Family/GeneralPractice: 7440   Mean   : 43.1      Mean   :1.34  
##  Cardiology            : 5352   3rd Qu.: 57.0      3rd Qu.:2.00  
##  Surgery-General       : 3099   Max.   :132.0      Max.   :6.00  
##  (Other)               :13726                                    
##  num_medications number_outpatient number_emergency  number_inpatient 
##  Min.   : 1.00   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.:10.00   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median :15.00   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   :16.02   Mean   : 0.3694   Mean   : 0.1978   Mean   : 0.6356  
##  3rd Qu.:20.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 1.0000  
##  Max.   :81.00   Max.   :42.0000   Max.   :76.0000   Max.   :21.0000  
##                                                                       
##      diag_1          diag_2          diag_3      number_diagnoses max_glu_serum
##  428    : 6862   276    : 6752   250    :11555   Min.   : 1.000   >200: 1485   
##  414    : 6581   428    : 6662   401    : 8289   1st Qu.: 6.000   >300: 1264   
##  786    : 4016   250    : 6071   276    : 5175   Median : 8.000   None:96420   
##  410    : 3614   427    : 5036   428    : 4577   Mean   : 7.423   Norm: 2597   
##  486    : 3508   401    : 3736   427    : 3955   3rd Qu.: 9.000                
##  427    : 2766   496    : 3305   414    : 3664   Max.   :16.000                
##  (Other):74419   (Other):70204   (Other):64551                                 
##  A1Cresult     metformin     repaglinide     nateglinide     chlorpropamide 
##  >7  : 3812   Down  :  575   Down  :    45   Down  :    11   Down  :     1  
##  >8  : 8216   No    :81778   No    :100227   No    :101063   No    :101680  
##  None:84748   Steady:18346   Steady:  1384   Steady:   668   Steady:    79  
##  Norm: 4990   Up    : 1067   Up    :   110   Up    :    24   Up    :     6  
##                                                                             
##                                                                             
##                                                                             
##  glimepiride    acetohexamide    glipizide      glyburide     tolbutamide    
##  Down  :  194   No    :101765   Down  :  560   Down  :  564   No    :101743  
##  No    :96575   Steady:     1   No    :89080   No    :91116   Steady:    23  
##  Steady: 4670                   Steady:11356   Steady: 9274                  
##  Up    :  327                   Up    :  770   Up    :  812                  
##                                                                              
##                                                                              
##                                                                              
##  pioglitazone   rosiglitazone    acarbose        miglitol      troglitazone   
##  Down  :  118   Down  :   87   Down  :     3   Down  :     5   No    :101763  
##  No    :94438   No    :95401   No    :101458   No    :101728   Steady:     3  
##  Steady: 6976   Steady: 6100   Steady:   295   Steady:    31                  
##  Up    :  234   Up    :  178   Up    :    10   Up    :     2                  
##                                                                               
##                                                                               
##                                                                               
##   tolazamide     examide     citoglipton   insulin      glyburide.metformin
##  No    :101727   No:101766   No:101766   Down  :12218   Down  :     6      
##  Steady:    38                           No    :47383   No    :101060      
##  Up    :     1                           Steady:30849   Steady:   692      
##                                          Up    :11316   Up    :     8      
##                                                                            
##                                                                            
##                                                                            
##  glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
##  No    :101753       No    :101765            No    :101764          
##  Steady:    13       Steady:     1            Steady:     2          
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##  metformin.pioglitazone change     diabetesMed readmitted 
##  No    :101765          Ch:47011   No :23403   <30:11357  
##  Steady:     1          No:54755   Yes:78363   >30:35545  
##                                                NO :54864  
##                                                           
##                                                           
##                                                           
## 
summary(raw_data$gender)
##          Female            Male Unknown/Invalid 
##           54708           47055               3
describe(raw_data)
##                           vars      n         mean           sd    median
## encounter_id                 1 101766 165201645.62 102640295.98 152388987
## patient_nbr                  2 101766  54330400.69  38696359.35  45505143
## race*                        3 101766         3.60         0.94         4
## gender*                      4 101766         1.46         0.50         1
## age*                         5 101766         7.10         1.59         7
## weight*                      6 101766         1.19         1.15         1
## admission_type_id            7 101766         2.02         1.45         1
## discharge_disposition_id     8 101766         3.72         5.28         1
## admission_source_id          9 101766         5.75         4.06         7
## time_in_hospital            10 101766         4.40         2.99         4
## payer_code*                 11 101766         5.89         4.82         8
## medical_specialty*          12 101766        12.71        17.51         5
## num_lab_procedures          13 101766        43.10        19.67        44
## num_procedures              14 101766         1.34         1.71         1
## num_medications             15 101766        16.02         8.13        15
## number_outpatient           16 101766         0.37         1.27         0
## number_emergency            17 101766         0.20         0.93         0
## number_inpatient            18 101766         0.64         1.26         0
## diag_1*                     19 101766       338.70       160.60       300
## diag_2*                     20 101766       277.48       153.51       262
## diag_3*                     21 101766       278.90       177.75       257
## number_diagnoses            22 101766         7.42         1.93         8
## max_glu_serum*              23 101766         2.98         0.31         3
## A1Cresult*                  24 101766         2.89         0.52         3
## metformin*                  25 101766         2.20         0.44         2
## repaglinide*                26 101766         2.02         0.13         2
## nateglinide*                27 101766         2.01         0.09         2
## chlorpropamide*             28 101766         2.00         0.03         2
## glimepiride*                29 101766         2.05         0.24         2
## acetohexamide*              30 101766         1.00         0.00         1
## glipizide*                  31 101766         2.12         0.36         2
## glyburide*                  32 101766         2.10         0.34         2
## tolbutamide*                33 101766         1.00         0.02         1
## pioglitazone*               34 101766         2.07         0.27         2
## rosiglitazone*              35 101766         2.06         0.25         2
## acarbose*                   36 101766         2.00         0.06         2
## miglitol*                   37 101766         2.00         0.02         2
## troglitazone*               38 101766         1.00         0.01         1
## tolazamide*                 39 101766         1.00         0.02         1
## examide*                    40 101766         1.00         0.00         1
## citoglipton*                41 101766         1.00         0.00         1
## insulin*                    42 101766         2.41         0.84         2
## glyburide.metformin*        43 101766         2.01         0.08         2
## glipizide.metformin*        44 101766         1.00         0.01         1
## glimepiride.pioglitazone*   45 101766         1.00         0.00         1
## metformin.rosiglitazone*    46 101766         1.00         0.00         1
## metformin.pioglitazone*     47 101766         1.00         0.00         1
## change*                     48 101766         1.54         0.50         2
## diabetesMed*                49 101766         1.77         0.42         2
## readmitted*                 50 101766         2.43         0.68         3
##                                trimmed          mad   min       max     range
## encounter_id              156080797.87 105147686.61 12522 443867222 443854700
## patient_nbr                52476125.15  48851868.67   135 189502619 189502484
## race*                             3.71         0.00     1         6         5
## gender*                           1.45         0.00     1         3         2
## age*                              7.21         1.48     1        10         9
## weight*                           1.00         0.00     1        10         9
## admission_type_id                 1.70         0.00     1         8         7
## discharge_disposition_id          2.28         0.00     1        28        27
## admission_source_id               5.33         0.00     1        25        24
## time_in_hospital                  3.99         2.97     1        14        13
## payer_code*                       5.33         8.90     1        18        17
## medical_specialty*                8.49         5.93     1        73        72
## num_lab_procedures               43.79        19.27     1       132       131
## num_procedures                    1.02         1.48     0         6         6
## num_medications                  15.23         7.41     1        81        80
## number_outpatient                 0.08         0.00     0        42        42
## number_emergency                  0.01         0.00     0        76        76
## number_inpatient                  0.35         0.00     0        21        21
## diag_1*                         332.73       146.78     1       717       716
## diag_2*                         264.90       174.95     1       749       748
## diag_3*                         257.05       171.98     1       790       789
## number_diagnoses                  7.71         1.48     1        16        15
## max_glu_serum*                    3.00         0.00     1         4         3
## A1Cresult*                        2.98         0.00     1         4         3
## metformin*                        2.11         0.00     1         4         3
## repaglinide*                      2.00         0.00     1         4         3
## nateglinide*                      2.00         0.00     1         4         3
## chlorpropamide*                   2.00         0.00     1         4         3
## glimepiride*                      2.00         0.00     1         4         3
## acetohexamide*                    1.00         0.00     1         2         1
## glipizide*                        2.02         0.00     1         4         3
## glyburide*                        2.00         0.00     1         4         3
## tolbutamide*                      1.00         0.00     1         2         1
## pioglitazone*                     2.00         0.00     1         4         3
## rosiglitazone*                    2.00         0.00     1         4         3
## acarbose*                         2.00         0.00     1         4         3
## miglitol*                         2.00         0.00     1         4         3
## troglitazone*                     1.00         0.00     1         2         1
## tolazamide*                       1.00         0.00     1         3         2
## examide*                          1.00         0.00     1         1         0
## citoglipton*                      1.00         0.00     1         1         0
## insulin*                          2.38         1.48     1         4         3
## glyburide.metformin*              2.00         0.00     1         4         3
## glipizide.metformin*              1.00         0.00     1         2         1
## glimepiride.pioglitazone*         1.00         0.00     1         2         1
## metformin.rosiglitazone*          1.00         0.00     1         2         1
## metformin.pioglitazone*           1.00         0.00     1         2         1
## change*                           1.55         0.00     1         2         1
## diabetesMed*                      1.84         0.00     1         2         1
## readmitted*                       2.53         0.00     1         3         2
##                             skew  kurtosis        se
## encounter_id                0.70     -0.10 321748.51
## patient_nbr                 0.47     -0.35 121302.22
## race*                      -1.04      0.66      0.00
## gender*                     0.15     -1.98      0.00
## age*                       -0.63      0.28      0.00
## weight*                     6.17     36.88      0.00
## admission_type_id           1.59      1.94      0.00
## discharge_disposition_id    2.56      6.00      0.02
## admission_source_id         1.03      1.74      0.01
## time_in_hospital            1.13      0.85      0.01
## payer_code*                 0.50     -0.72      0.02
## medical_specialty*          1.89      2.95      0.05
## num_lab_procedures         -0.24     -0.25      0.06
## num_procedures              1.32      0.86      0.01
## num_medications             1.33      3.47      0.03
## number_outpatient           8.83    147.90      0.00
## number_emergency           22.85   1191.60      0.00
## number_inpatient            3.61     20.72      0.00
## diag_1*                     0.38     -0.34      0.50
## diag_2*                     0.72      0.41      0.48
## diag_3*                     1.01      0.69      0.56
## number_diagnoses           -0.88     -0.08      0.01
## max_glu_serum*             -3.33     25.71      0.00
## A1Cresult*                 -1.76      5.43      0.00
## metformin*                  1.69      2.54      0.00
## repaglinide*                8.59     88.34      0.00
## nateglinide*               12.43    175.40      0.00
## chlorpropamide*            37.86   1651.27      0.00
## glimepiride*                4.34     22.49      0.00
## acetohexamide*            319.00 101759.00      0.00
## glipizide*                  2.41      6.64      0.00
## glyburide*                  2.76      9.24      0.00
## tolbutamide*               66.49   4419.52      0.00
## pioglitazone*               3.47     12.49      0.00
## rosiglitazone*              3.77     14.65      0.00
## acarbose*                  19.02    403.22      0.00
## miglitol*                  45.88   3570.19      0.00
## troglitazone*             184.17  33916.33      0.00
## tolazamide*                53.88   3110.38      0.00
## examide*                     NaN       NaN      0.00
## citoglipton*                 NaN       NaN      0.00
## insulin*                    0.25     -0.50      0.00
## glyburide.metformin*       12.01    152.88      0.00
## glipizide.metformin*       88.46   7823.00      0.00
## glimepiride.pioglitazone* 319.00 101759.00      0.00
## metformin.rosiglitazone*  225.56  50877.00      0.00
## metformin.pioglitazone*   319.00 101759.00      0.00
## change*                    -0.15     -1.98      0.00
## diabetesMed*               -1.28     -0.35      0.00
## readmitted*                -0.78     -0.57      0.00

Data Analysis

Cleaning

Dropping unnecessary/unvaluable variables or variables with too many missing values.I dropped “encounter_id”, “patient_nbr”, “weight”, “payer_code”, “medical_specialty”, “diabetesMed”, “diag_2”, “diag_3”,“diag_1” due to high missing values.One more addition to dropping cols that any patient who’s discharge status is “expired” will be dropped.I also needed to remove unknown gender because I cant impute the issing values in “gender” column.I will avoid cols that has zero change in values (nearzerovariance)

drops <- c("encounter_id", "patient_nbr", "weight", "payer_code", 
           "medical_specialty", "diabetesMed", "diag_2", "diag_3","diag_1")
raw_data <- raw_data[ , !(names(raw_data) %in% drops)]
raw_data <- filter(raw_data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))
# Gender
raw_data <- raw_data %>% filter(gender != "Unknown/Invalid")
raw_data$gender <- droplevels(raw_data$gender)
# race
raw_data <- raw_data %>% filter(race != "NA")
raw_data$race <- droplevels(raw_data$race)
avoid_features = nearZeroVar(raw_data) # change to raw_data_imputed
raw_data =  raw_data[,- avoid_features]
kable(head(raw_data))
race gender age admission_type_id discharge_disposition_id admission_source_id time_in_hospital num_lab_procedures num_procedures num_medications number_outpatient number_emergency number_inpatient number_diagnoses A1Cresult metformin glipizide glyburide pioglitazone rosiglitazone insulin change readmitted
Caucasian Female [0-10) 6 25 1 1 41 0 1 0 0 0 1 None No No No No No No No NO
Caucasian Female [10-20) 1 1 7 3 59 0 18 0 0 0 9 None No No No No No Up Ch >30
AfricanAmerican Female [20-30) 1 1 7 2 11 5 13 2 0 1 6 None No Steady No No No No No NO
Caucasian Male [30-40) 1 1 7 2 44 1 16 0 0 0 7 None No No No No No Up Ch NO
Caucasian Male [40-50) 1 1 7 1 51 0 8 0 0 0 5 None No Steady No No No Steady Ch NO
Caucasian Male [50-60) 2 1 2 3 31 6 16 0 0 0 9 None No No No No No Steady No >30

Encoding

For the Encoding process, I used the following process

  • Age : put mean value in any range of values. For example [0-10) =5,[10,20)=15,etc..

  • Medication Change : no change = 0, change = 1

  • Gender : Feamle = 0, MAle = 1

  • Race : Caucasian = 0 African American = 1 Other = 2

  • Insulin dosage : no insulin = 0; decrease in insulin = -1; steady insulin = 1; increase in insulin = 2

  • rosiglitazone : No == 0; Steady == 1

  • pioglitazone : No == 0; Steady == 1

  • glyburide : No == 0; Steady == 1

  • glipizide : No == 0; Steady == 1

  • metformin : No == 0; Steady == 1

  • A1C results : None == 0 Normal == 1 abnormal (>7 or >8) == 2

  • Target(readmitted) : “readmitted” - No Readmission within 30 days == 0; Readmission in <30 days == 1

# Age*

# age will be the median of the 10 years age interval.
raw_data$age <- ifelse(raw_data$age == "[0-10)", 5, 
                  ifelse(raw_data$age == "[10-20)", 15, 
                         ifelse(raw_data$age == "[20-30)", 25, 
                                ifelse(raw_data$age == "[30-40)", 35, 
                                       ifelse(raw_data$age == "[40-50)", 45, 
                                              ifelse(raw_data$age == "[50-60)", 55, 
                                                     ifelse(raw_data$age == "[60-70)", 65, 
                                                            ifelse(raw_data$age == "[70-80)", 75, 
                                                                   ifelse(raw_data$age == "[80-90)", 85, 95)))))))))
                                                                    
#Medication Change** 

#no change == 0; change == 1
raw_data$change <- ifelse(raw_data$change == "Ch", 1, 0)


#Gender**
# Female == 0; Male == 1
raw_data$gender <- ifelse(raw_data$gender == "Female", 0, 1)


#Race** 

# Caucasian ==0  African American == 1 Other == 2
raw_data$race <- ifelse(raw_data$race == "Caucasian", 0, ifelse(raw_data$race == "AfricanAmerican", 1, 2))


#insulin dosage**

# no insulin = 0; decrease in insulin = -1; steady insulin = 1; increase in insulin = 2
raw_data$insulin <- ifelse(raw_data$insulin == "No", 0, ifelse(raw_data$insulin == "Down", -1, ifelse(raw_data$insulin == "Steady", 1, 2)))



#rosiglitazone**

# No == 0; Steady == 1
raw_data$rosiglitazone <- ifelse(raw_data$rosiglitazone == "No", 0, 1)


#pioglitazone**

# No == 0; Steady == 1
raw_data$pioglitazone <- ifelse(raw_data$pioglitazone == "No", 0, 1)


#glyburide**

# No == 0; Steady == 1
raw_data$glyburide <- ifelse(raw_data$glyburide == "No", 0, 1)


#glipizide**

# No == 0; Steady == 1
raw_data$glipizide <- ifelse(raw_data$glipizide == "No", 0, 1)


#metformin**

# No == 0; Steady == 1
raw_data$metformin <- ifelse(raw_data$metformin == "No", 0, 1)


#A1C results**

# None == 0 Normal  == 1 abnormal  (>7 or >8) == 2
raw_data$A1Cresult <- ifelse(raw_data$A1Cresult == "None", 0, 
                         ifelse(raw_data$A1Cresult == "Norm", 1, 
                                ifelse(raw_data$A1Cresult %in% c(">7", ">8"), 2, NA)))


#Target* 

# "readmitted" - No Readmission within 30 days == 0; Readmission in <30 days == 1
raw_data$readmitted <- ifelse((raw_data$readmitted == "NO" | raw_data$readmitted == ">30"), 0, 1)

Visualization

The target variable, readmitted value , is shown on histogram here. We can see that it is a categorical variable with no gap at no clear patterns of missing value.There are over 60,000 as not-readmitted value (0), and over 1000 readmitted values (1).

Target Variable (readmitted)

The target variable, readmitted value , is shown on histogram here. We can see that it is a categorical variable with no gap at no clear patterns of missing value.There are over 60,000 as not-readmitted value (0), and over 1000 readmitted values (1).

raw_data %>%
  ggplot(aes(readmitted)) + 
  geom_histogram(bins = 30) +
  theme_bw() +
  theme(legend.position = 'center') +
  labs(y = 'Count', title = 'Readmited Histogram') 

Predictors

raw_data[,-c(1)]  %>%
  gather(Variable, Values) %>%
  ggplot(aes(x = Values)) +
  geom_histogram(alpha = 0.2, col = "black", bins = 15) +
  facet_wrap(~ Variable, scales = "free", nrow = 6)

  • Visualization of histogram of each individual predictor variables indicate that beside the numerical variables, there are many categorical variables (discrete variables), such as num_impatient, number_emergncy, etc).

  • The obvious discrete variables are: Each of these varaiables have no more than 10-12 unique numbers to make the count.

    • Age,

    • Admission_type_id,

    • time_in_hospital,

    • num_medications,

    • num_lab_procedure

  • There are some bi-mode variables:

  • num_diagnosis

  • Histogram also indicated the right skewness of Age, which has a spike of counts at around the 70. Thi histogram also indicates left skewness of time_in_hospital ,which has a spike of counts at around 10.

  • We chose bins=15 and facet wrap for the histograms. This findings are preserved after changing the numbers of bins.

Outliers Analysis with Boxplot

raw_data[,-c(1)] %>% 
  gather(Variable, Values) %>% 
  ggplot(aes( y = Values)) +
  geom_boxplot() +
  facet_wrap(~ Variable, scales = "free", nrow = 4) +
    theme(panel.background = element_rect(fill = 'white'),
        axis.text.x = element_text(size = 10, angle = 90)) 

Because some of the variables are skewed, so the box plot shows data many of these predictors are recognized as outliers. these variables include:

  • A1Cresult ,

    • number_emergncy,

    • number_medication,

    • time_in_hospital

    • num_lab_precedosure