library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.3.5 v purrr 0.3.2
## v tibble 2.1.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'stringr' was built under R version 3.6.3
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggplot2)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
The data set describes readmitted patients in diabetics. There are about 50 variables included that cover chacterictics of readmitted and not-readmitted patients. Most variables describe the patient characteristics, medical conditions or features of medical threatments but other variables provide measures of quality and condition. The data types are varied and include discrete, continuous, and categorical (both nominal and ordinal) data.
The data was originally published by the UCI Machine Learning Repository.There are 101766 observations that has no missing values in any of the 43 columns.I will talk about the data chacteristics, statistics and description in below.
The focus of this research is to use data assess the hypothesis: 1- Can we avoid the 30 days readmission by estimating how much of an impact do features have on the readmission of diabetes? 2- Which of various prediction models would be able to predict whether the patient will be readmitted to the hospital or not via good accuracy?
# Reading
raw_data <- read.csv("diabetic_data.csv")
kable(head(raw_data))
encounter_id | patient_nbr | race | gender | age | weight | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | payer_code | medical_specialty | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | diag_1 | diag_2 | diag_3 | number_diagnoses | max_glu_serum | A1Cresult | metformin | repaglinide | nateglinide | chlorpropamide | glimepiride | acetohexamide | glipizide | glyburide | tolbutamide | pioglitazone | rosiglitazone | acarbose | miglitol | troglitazone | tolazamide | examide | citoglipton | insulin | glyburide.metformin | glipizide.metformin | glimepiride.pioglitazone | metformin.rosiglitazone | metformin.pioglitazone | change | diabetesMed | readmitted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2278392 | 8222157 | Caucasian | Female | [0-10) | ? | 6 | 25 | 1 | 1 | ? | Pediatrics-Endocrinology | 41 | 0 | 1 | 0 | 0 | 0 | 250.83 | ? | ? | 1 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | NO |
149190 | 55629189 | Caucasian | Female | [10-20) | ? | 1 | 1 | 7 | 3 | ? | ? | 59 | 0 | 18 | 0 | 0 | 0 | 276 | 250.01 | 255 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | >30 |
64410 | 86047875 | AfricanAmerican | Female | [20-30) | ? | 1 | 1 | 7 | 2 | ? | ? | 11 | 5 | 13 | 2 | 0 | 1 | 648 | 250 | V27 | 6 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | NO |
500364 | 82442376 | Caucasian | Male | [30-40) | ? | 1 | 1 | 7 | 2 | ? | ? | 44 | 1 | 16 | 0 | 0 | 0 | 8 | 250.43 | 403 | 7 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Up | No | No | No | No | No | Ch | Yes | NO |
16680 | 42519267 | Caucasian | Male | [40-50) | ? | 1 | 1 | 7 | 1 | ? | ? | 51 | 0 | 8 | 0 | 0 | 0 | 197 | 157 | 250 | 5 | None | None | No | No | No | No | No | No | Steady | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | Ch | Yes | NO |
35754 | 82637451 | Caucasian | Male | [50-60) | ? | 2 | 1 | 2 | 3 | ? | ? | 31 | 6 | 16 | 0 | 0 | 0 | 414 | 411 | 250 | 9 | None | None | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Steady | No | No | No | No | No | No | Yes | >30 |
dim(raw_data)
## [1] 101766 50
The diabetes dataset we have used consists of 100,000 records and 50 features. Below given is the detailed description of 10 of the important features in the dataset.
Encounter ID: TYPE : Continuous . DESC : a unique identifier for each claim record
Race: TYPE : Categorical . DESC : African-American, Asian, Caucasian, Hispanic, Other.
Gender: TYPE : Categorical. DESC : Male, Female, Unknown.
Age: TYPE : Categorical. DESC: [0-10), [10-20), [20-30), [30-40), [40-50), [50-60), [60-70), [70-80), [80-90), [90-100).
Weight: TYPE : Continuous DESC: Weights for patientS.
Admission type ID: TYPE : Categorical DESC: Elective, Emergency, Newborn, Not Available, Not Mapped, Urgent.
Discharge disposition ID: TYPE : Categorical. DESC: Admitted as an inpatient to this hospital
Admission source ID: TYPE : Categorical. DESC: Clinic Referral Court/Law, Enforcement, Emergency Room, HMO Referral
Time in hospital: TYPE : Numerical DESC: Time (Days) stayed in the hospital by the patientS.
Readmitted: TYPE: Categorical. DESC : FALSE, TRUE.
The current data set is composed of 101766 records and 50 features.
summary(raw_data)
## encounter_id patient_nbr race
## Min. : 12522 Min. : 135 ? : 2273
## 1st Qu.: 84961194 1st Qu.: 23413221 AfricanAmerican:19210
## Median :152388987 Median : 45505143 Asian : 641
## Mean :165201646 Mean : 54330401 Caucasian :76099
## 3rd Qu.:230270888 3rd Qu.: 87545950 Hispanic : 2037
## Max. :443867222 Max. :189502619 Other : 1506
##
## gender age weight admission_type_id
## Female :54708 [70-80):26068 ? :98569 Min. :1.000
## Male :47055 [60-70):22483 [75-100) : 1336 1st Qu.:1.000
## Unknown/Invalid: 3 [50-60):17256 [50-75) : 897 Median :1.000
## [80-90):17197 [100-125): 625 Mean :2.024
## [40-50): 9685 [125-150): 145 3rd Qu.:3.000
## [30-40): 3775 [25-50) : 97 Max. :8.000
## (Other): 5302 (Other) : 97
## discharge_disposition_id admission_source_id time_in_hospital payer_code
## Min. : 1.000 Min. : 1.000 Min. : 1.000 ? :40256
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 MC :32439
## Median : 1.000 Median : 7.000 Median : 4.000 HM : 6274
## Mean : 3.716 Mean : 5.754 Mean : 4.396 SP : 5007
## 3rd Qu.: 4.000 3rd Qu.: 7.000 3rd Qu.: 6.000 BC : 4655
## Max. :28.000 Max. :25.000 Max. :14.000 MD : 3532
## (Other): 9603
## medical_specialty num_lab_procedures num_procedures
## ? :49949 Min. : 1.0 Min. :0.00
## InternalMedicine :14635 1st Qu.: 31.0 1st Qu.:0.00
## Emergency/Trauma : 7565 Median : 44.0 Median :1.00
## Family/GeneralPractice: 7440 Mean : 43.1 Mean :1.34
## Cardiology : 5352 3rd Qu.: 57.0 3rd Qu.:2.00
## Surgery-General : 3099 Max. :132.0 Max. :6.00
## (Other) :13726
## num_medications number_outpatient number_emergency number_inpatient
## Min. : 1.00 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:10.00 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :15.00 Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean :16.02 Mean : 0.3694 Mean : 0.1978 Mean : 0.6356
## 3rd Qu.:20.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 1.0000
## Max. :81.00 Max. :42.0000 Max. :76.0000 Max. :21.0000
##
## diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 428 : 6862 276 : 6752 250 :11555 Min. : 1.000 >200: 1485
## 414 : 6581 428 : 6662 401 : 8289 1st Qu.: 6.000 >300: 1264
## 786 : 4016 250 : 6071 276 : 5175 Median : 8.000 None:96420
## 410 : 3614 427 : 5036 428 : 4577 Mean : 7.423 Norm: 2597
## 486 : 3508 401 : 3736 427 : 3955 3rd Qu.: 9.000
## 427 : 2766 496 : 3305 414 : 3664 Max. :16.000
## (Other):74419 (Other):70204 (Other):64551
## A1Cresult metformin repaglinide nateglinide chlorpropamide
## >7 : 3812 Down : 575 Down : 45 Down : 11 Down : 1
## >8 : 8216 No :81778 No :100227 No :101063 No :101680
## None:84748 Steady:18346 Steady: 1384 Steady: 668 Steady: 79
## Norm: 4990 Up : 1067 Up : 110 Up : 24 Up : 6
##
##
##
## glimepiride acetohexamide glipizide glyburide tolbutamide
## Down : 194 No :101765 Down : 560 Down : 564 No :101743
## No :96575 Steady: 1 No :89080 No :91116 Steady: 23
## Steady: 4670 Steady:11356 Steady: 9274
## Up : 327 Up : 770 Up : 812
##
##
##
## pioglitazone rosiglitazone acarbose miglitol troglitazone
## Down : 118 Down : 87 Down : 3 Down : 5 No :101763
## No :94438 No :95401 No :101458 No :101728 Steady: 3
## Steady: 6976 Steady: 6100 Steady: 295 Steady: 31
## Up : 234 Up : 178 Up : 10 Up : 2
##
##
##
## tolazamide examide citoglipton insulin glyburide.metformin
## No :101727 No:101766 No:101766 Down :12218 Down : 6
## Steady: 38 No :47383 No :101060
## Up : 1 Steady:30849 Steady: 692
## Up :11316 Up : 8
##
##
##
## glipizide.metformin glimepiride.pioglitazone metformin.rosiglitazone
## No :101753 No :101765 No :101764
## Steady: 13 Steady: 1 Steady: 2
##
##
##
##
##
## metformin.pioglitazone change diabetesMed readmitted
## No :101765 Ch:47011 No :23403 <30:11357
## Steady: 1 No:54755 Yes:78363 >30:35545
## NO :54864
##
##
##
##
summary(raw_data$gender)
## Female Male Unknown/Invalid
## 54708 47055 3
describe(raw_data)
## vars n mean sd median
## encounter_id 1 101766 165201645.62 102640295.98 152388987
## patient_nbr 2 101766 54330400.69 38696359.35 45505143
## race* 3 101766 3.60 0.94 4
## gender* 4 101766 1.46 0.50 1
## age* 5 101766 7.10 1.59 7
## weight* 6 101766 1.19 1.15 1
## admission_type_id 7 101766 2.02 1.45 1
## discharge_disposition_id 8 101766 3.72 5.28 1
## admission_source_id 9 101766 5.75 4.06 7
## time_in_hospital 10 101766 4.40 2.99 4
## payer_code* 11 101766 5.89 4.82 8
## medical_specialty* 12 101766 12.71 17.51 5
## num_lab_procedures 13 101766 43.10 19.67 44
## num_procedures 14 101766 1.34 1.71 1
## num_medications 15 101766 16.02 8.13 15
## number_outpatient 16 101766 0.37 1.27 0
## number_emergency 17 101766 0.20 0.93 0
## number_inpatient 18 101766 0.64 1.26 0
## diag_1* 19 101766 338.70 160.60 300
## diag_2* 20 101766 277.48 153.51 262
## diag_3* 21 101766 278.90 177.75 257
## number_diagnoses 22 101766 7.42 1.93 8
## max_glu_serum* 23 101766 2.98 0.31 3
## A1Cresult* 24 101766 2.89 0.52 3
## metformin* 25 101766 2.20 0.44 2
## repaglinide* 26 101766 2.02 0.13 2
## nateglinide* 27 101766 2.01 0.09 2
## chlorpropamide* 28 101766 2.00 0.03 2
## glimepiride* 29 101766 2.05 0.24 2
## acetohexamide* 30 101766 1.00 0.00 1
## glipizide* 31 101766 2.12 0.36 2
## glyburide* 32 101766 2.10 0.34 2
## tolbutamide* 33 101766 1.00 0.02 1
## pioglitazone* 34 101766 2.07 0.27 2
## rosiglitazone* 35 101766 2.06 0.25 2
## acarbose* 36 101766 2.00 0.06 2
## miglitol* 37 101766 2.00 0.02 2
## troglitazone* 38 101766 1.00 0.01 1
## tolazamide* 39 101766 1.00 0.02 1
## examide* 40 101766 1.00 0.00 1
## citoglipton* 41 101766 1.00 0.00 1
## insulin* 42 101766 2.41 0.84 2
## glyburide.metformin* 43 101766 2.01 0.08 2
## glipizide.metformin* 44 101766 1.00 0.01 1
## glimepiride.pioglitazone* 45 101766 1.00 0.00 1
## metformin.rosiglitazone* 46 101766 1.00 0.00 1
## metformin.pioglitazone* 47 101766 1.00 0.00 1
## change* 48 101766 1.54 0.50 2
## diabetesMed* 49 101766 1.77 0.42 2
## readmitted* 50 101766 2.43 0.68 3
## trimmed mad min max range
## encounter_id 156080797.87 105147686.61 12522 443867222 443854700
## patient_nbr 52476125.15 48851868.67 135 189502619 189502484
## race* 3.71 0.00 1 6 5
## gender* 1.45 0.00 1 3 2
## age* 7.21 1.48 1 10 9
## weight* 1.00 0.00 1 10 9
## admission_type_id 1.70 0.00 1 8 7
## discharge_disposition_id 2.28 0.00 1 28 27
## admission_source_id 5.33 0.00 1 25 24
## time_in_hospital 3.99 2.97 1 14 13
## payer_code* 5.33 8.90 1 18 17
## medical_specialty* 8.49 5.93 1 73 72
## num_lab_procedures 43.79 19.27 1 132 131
## num_procedures 1.02 1.48 0 6 6
## num_medications 15.23 7.41 1 81 80
## number_outpatient 0.08 0.00 0 42 42
## number_emergency 0.01 0.00 0 76 76
## number_inpatient 0.35 0.00 0 21 21
## diag_1* 332.73 146.78 1 717 716
## diag_2* 264.90 174.95 1 749 748
## diag_3* 257.05 171.98 1 790 789
## number_diagnoses 7.71 1.48 1 16 15
## max_glu_serum* 3.00 0.00 1 4 3
## A1Cresult* 2.98 0.00 1 4 3
## metformin* 2.11 0.00 1 4 3
## repaglinide* 2.00 0.00 1 4 3
## nateglinide* 2.00 0.00 1 4 3
## chlorpropamide* 2.00 0.00 1 4 3
## glimepiride* 2.00 0.00 1 4 3
## acetohexamide* 1.00 0.00 1 2 1
## glipizide* 2.02 0.00 1 4 3
## glyburide* 2.00 0.00 1 4 3
## tolbutamide* 1.00 0.00 1 2 1
## pioglitazone* 2.00 0.00 1 4 3
## rosiglitazone* 2.00 0.00 1 4 3
## acarbose* 2.00 0.00 1 4 3
## miglitol* 2.00 0.00 1 4 3
## troglitazone* 1.00 0.00 1 2 1
## tolazamide* 1.00 0.00 1 3 2
## examide* 1.00 0.00 1 1 0
## citoglipton* 1.00 0.00 1 1 0
## insulin* 2.38 1.48 1 4 3
## glyburide.metformin* 2.00 0.00 1 4 3
## glipizide.metformin* 1.00 0.00 1 2 1
## glimepiride.pioglitazone* 1.00 0.00 1 2 1
## metformin.rosiglitazone* 1.00 0.00 1 2 1
## metformin.pioglitazone* 1.00 0.00 1 2 1
## change* 1.55 0.00 1 2 1
## diabetesMed* 1.84 0.00 1 2 1
## readmitted* 2.53 0.00 1 3 2
## skew kurtosis se
## encounter_id 0.70 -0.10 321748.51
## patient_nbr 0.47 -0.35 121302.22
## race* -1.04 0.66 0.00
## gender* 0.15 -1.98 0.00
## age* -0.63 0.28 0.00
## weight* 6.17 36.88 0.00
## admission_type_id 1.59 1.94 0.00
## discharge_disposition_id 2.56 6.00 0.02
## admission_source_id 1.03 1.74 0.01
## time_in_hospital 1.13 0.85 0.01
## payer_code* 0.50 -0.72 0.02
## medical_specialty* 1.89 2.95 0.05
## num_lab_procedures -0.24 -0.25 0.06
## num_procedures 1.32 0.86 0.01
## num_medications 1.33 3.47 0.03
## number_outpatient 8.83 147.90 0.00
## number_emergency 22.85 1191.60 0.00
## number_inpatient 3.61 20.72 0.00
## diag_1* 0.38 -0.34 0.50
## diag_2* 0.72 0.41 0.48
## diag_3* 1.01 0.69 0.56
## number_diagnoses -0.88 -0.08 0.01
## max_glu_serum* -3.33 25.71 0.00
## A1Cresult* -1.76 5.43 0.00
## metformin* 1.69 2.54 0.00
## repaglinide* 8.59 88.34 0.00
## nateglinide* 12.43 175.40 0.00
## chlorpropamide* 37.86 1651.27 0.00
## glimepiride* 4.34 22.49 0.00
## acetohexamide* 319.00 101759.00 0.00
## glipizide* 2.41 6.64 0.00
## glyburide* 2.76 9.24 0.00
## tolbutamide* 66.49 4419.52 0.00
## pioglitazone* 3.47 12.49 0.00
## rosiglitazone* 3.77 14.65 0.00
## acarbose* 19.02 403.22 0.00
## miglitol* 45.88 3570.19 0.00
## troglitazone* 184.17 33916.33 0.00
## tolazamide* 53.88 3110.38 0.00
## examide* NaN NaN 0.00
## citoglipton* NaN NaN 0.00
## insulin* 0.25 -0.50 0.00
## glyburide.metformin* 12.01 152.88 0.00
## glipizide.metformin* 88.46 7823.00 0.00
## glimepiride.pioglitazone* 319.00 101759.00 0.00
## metformin.rosiglitazone* 225.56 50877.00 0.00
## metformin.pioglitazone* 319.00 101759.00 0.00
## change* -0.15 -1.98 0.00
## diabetesMed* -1.28 -0.35 0.00
## readmitted* -0.78 -0.57 0.00
Dropping unnecessary/unvaluable variables or variables with too many missing values.I dropped “encounter_id”, “patient_nbr”, “weight”, “payer_code”, “medical_specialty”, “diabetesMed”, “diag_2”, “diag_3”,“diag_1” due to high missing values.One more addition to dropping cols that any patient who’s discharge status is “expired” will be dropped.I also needed to remove unknown gender because I cant impute the issing values in “gender” column.I will avoid cols that has zero change in values (nearzerovariance)
drops <- c("encounter_id", "patient_nbr", "weight", "payer_code",
"medical_specialty", "diabetesMed", "diag_2", "diag_3","diag_1")
raw_data <- raw_data[ , !(names(raw_data) %in% drops)]
raw_data <- filter(raw_data, !(discharge_disposition_id %in% c(11, 19, 20, 21)))
# Gender
raw_data <- raw_data %>% filter(gender != "Unknown/Invalid")
raw_data$gender <- droplevels(raw_data$gender)
# race
raw_data <- raw_data %>% filter(race != "NA")
raw_data$race <- droplevels(raw_data$race)
avoid_features = nearZeroVar(raw_data) # change to raw_data_imputed
raw_data = raw_data[,- avoid_features]
kable(head(raw_data))
race | gender | age | admission_type_id | discharge_disposition_id | admission_source_id | time_in_hospital | num_lab_procedures | num_procedures | num_medications | number_outpatient | number_emergency | number_inpatient | number_diagnoses | A1Cresult | metformin | glipizide | glyburide | pioglitazone | rosiglitazone | insulin | change | readmitted |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Caucasian | Female | [0-10) | 6 | 25 | 1 | 1 | 41 | 0 | 1 | 0 | 0 | 0 | 1 | None | No | No | No | No | No | No | No | NO |
Caucasian | Female | [10-20) | 1 | 1 | 7 | 3 | 59 | 0 | 18 | 0 | 0 | 0 | 9 | None | No | No | No | No | No | Up | Ch | >30 |
AfricanAmerican | Female | [20-30) | 1 | 1 | 7 | 2 | 11 | 5 | 13 | 2 | 0 | 1 | 6 | None | No | Steady | No | No | No | No | No | NO |
Caucasian | Male | [30-40) | 1 | 1 | 7 | 2 | 44 | 1 | 16 | 0 | 0 | 0 | 7 | None | No | No | No | No | No | Up | Ch | NO |
Caucasian | Male | [40-50) | 1 | 1 | 7 | 1 | 51 | 0 | 8 | 0 | 0 | 0 | 5 | None | No | Steady | No | No | No | Steady | Ch | NO |
Caucasian | Male | [50-60) | 2 | 1 | 2 | 3 | 31 | 6 | 16 | 0 | 0 | 0 | 9 | None | No | No | No | No | No | Steady | No | >30 |
For the Encoding process, I used the following process
Age : put mean value in any range of values. For example [0-10) =5,[10,20)=15,etc..
Medication Change : no change = 0, change = 1
Gender : Feamle = 0, MAle = 1
Race : Caucasian = 0 African American = 1 Other = 2
Insulin dosage : no insulin = 0; decrease in insulin = -1; steady insulin = 1; increase in insulin = 2
rosiglitazone : No == 0; Steady == 1
pioglitazone : No == 0; Steady == 1
glyburide : No == 0; Steady == 1
glipizide : No == 0; Steady == 1
metformin : No == 0; Steady == 1
A1C results : None == 0 Normal == 1 abnormal (>7 or >8) == 2
Target(readmitted) : “readmitted” - No Readmission within 30 days == 0; Readmission in <30 days == 1
# Age*
# age will be the median of the 10 years age interval.
raw_data$age <- ifelse(raw_data$age == "[0-10)", 5,
ifelse(raw_data$age == "[10-20)", 15,
ifelse(raw_data$age == "[20-30)", 25,
ifelse(raw_data$age == "[30-40)", 35,
ifelse(raw_data$age == "[40-50)", 45,
ifelse(raw_data$age == "[50-60)", 55,
ifelse(raw_data$age == "[60-70)", 65,
ifelse(raw_data$age == "[70-80)", 75,
ifelse(raw_data$age == "[80-90)", 85, 95)))))))))
#Medication Change**
#no change == 0; change == 1
raw_data$change <- ifelse(raw_data$change == "Ch", 1, 0)
#Gender**
# Female == 0; Male == 1
raw_data$gender <- ifelse(raw_data$gender == "Female", 0, 1)
#Race**
# Caucasian ==0 African American == 1 Other == 2
raw_data$race <- ifelse(raw_data$race == "Caucasian", 0, ifelse(raw_data$race == "AfricanAmerican", 1, 2))
#insulin dosage**
# no insulin = 0; decrease in insulin = -1; steady insulin = 1; increase in insulin = 2
raw_data$insulin <- ifelse(raw_data$insulin == "No", 0, ifelse(raw_data$insulin == "Down", -1, ifelse(raw_data$insulin == "Steady", 1, 2)))
#rosiglitazone**
# No == 0; Steady == 1
raw_data$rosiglitazone <- ifelse(raw_data$rosiglitazone == "No", 0, 1)
#pioglitazone**
# No == 0; Steady == 1
raw_data$pioglitazone <- ifelse(raw_data$pioglitazone == "No", 0, 1)
#glyburide**
# No == 0; Steady == 1
raw_data$glyburide <- ifelse(raw_data$glyburide == "No", 0, 1)
#glipizide**
# No == 0; Steady == 1
raw_data$glipizide <- ifelse(raw_data$glipizide == "No", 0, 1)
#metformin**
# No == 0; Steady == 1
raw_data$metformin <- ifelse(raw_data$metformin == "No", 0, 1)
#A1C results**
# None == 0 Normal == 1 abnormal (>7 or >8) == 2
raw_data$A1Cresult <- ifelse(raw_data$A1Cresult == "None", 0,
ifelse(raw_data$A1Cresult == "Norm", 1,
ifelse(raw_data$A1Cresult %in% c(">7", ">8"), 2, NA)))
#Target*
# "readmitted" - No Readmission within 30 days == 0; Readmission in <30 days == 1
raw_data$readmitted <- ifelse((raw_data$readmitted == "NO" | raw_data$readmitted == ">30"), 0, 1)
The target variable, readmitted value , is shown on histogram here. We can see that it is a categorical variable with no gap at no clear patterns of missing value.There are over 60,000 as not-readmitted value (0), and over 1000 readmitted values (1).
The target variable, readmitted value , is shown on histogram here. We can see that it is a categorical variable with no gap at no clear patterns of missing value.There are over 60,000 as not-readmitted value (0), and over 1000 readmitted values (1).
raw_data %>%
ggplot(aes(readmitted)) +
geom_histogram(bins = 30) +
theme_bw() +
theme(legend.position = 'center') +
labs(y = 'Count', title = 'Readmited Histogram')
raw_data[,-c(1)] %>%
gather(Variable, Values) %>%
ggplot(aes(x = Values)) +
geom_histogram(alpha = 0.2, col = "black", bins = 15) +
facet_wrap(~ Variable, scales = "free", nrow = 6)
Visualization of histogram of each individual predictor variables indicate that beside the numerical variables, there are many categorical variables (discrete variables), such as num_impatient, number_emergncy, etc).
The obvious discrete variables are: Each of these varaiables have no more than 10-12 unique numbers to make the count.
Age,
Admission_type_id,
time_in_hospital,
num_medications,
num_lab_procedure
There are some bi-mode variables:
num_diagnosis
Histogram also indicated the right skewness of Age, which has a spike of counts at around the 70. Thi histogram also indicates left skewness of time_in_hospital ,which has a spike of counts at around 10.
We chose bins=15 and facet wrap for the histograms. This findings are preserved after changing the numbers of bins.
raw_data[,-c(1)] %>%
gather(Variable, Values) %>%
ggplot(aes( y = Values)) +
geom_boxplot() +
facet_wrap(~ Variable, scales = "free", nrow = 4) +
theme(panel.background = element_rect(fill = 'white'),
axis.text.x = element_text(size = 10, angle = 90))
Because some of the variables are skewed, so the box plot shows data many of these predictors are recognized as outliers. these variables include:
A1Cresult ,
number_emergncy,
number_medication,
time_in_hospital
num_lab_precedosure