Diabetes is a serious medical condition which occurs when the pancreas does not create sufficient insulin, a hormone required to process blood sugar (World Health Organization, 2024). The World Health Organization reports increasing prevalence of diabetes worldwide, with an increase from 200 million individuals in 1990 to 860 million in 2022 (World Health Organization, 2024). Additionally, complications from diabetes result in serious medical complications or death (World Health Organization 2024). Complications and the progression of the disease can be mitigated with a variety of lifestyle changes, including but not limited to changes in food and exercise choices (World Health Organization, 2024). Early prediction of the development of diabetes and medical complications could potentially positively benefit millions of individuals.
Sampling methods are beneficial in data analysis with large, complex, and high dimensional datasets due to the increased computational efficiency in analysis of a representative sample of observations. A variety of sampling methods exist, including sampling based on traditional methods of random, stratified, and systematic sampling. The purpose of this project is determine the impact of random, stratified, and systematic sampling distributions for data analysis of the UCI 130 Hospitals in the US from 1999 – 2008 dataset (hereafter referred to as the 130 Hospitals Diabetes dataset).
#Set working directory
getwd()
## [1] "C:/Users/benke/OneDrive/NU/8521/Assignment 3"
setwd("C:/Users/benke/Downloads")
The 130 Hospitals Diabetes dataset (Clore et al., 2014) was downloaded from the UCI Machine Learning Repository for analysis. R Studio was the Interactive Development Environment (IDE) for analysis. A R Markdown document was created and knitted to an HTML file for review.
#Load dataset
Diabetes.df <- read.csv("diabetic_data.csv")
#Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Amelia)
## Warning: package 'Amelia' was built under R version 4.5.1
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)
library(scales)
library(moments)
library(readr)
## Warning: package 'readr' was built under R version 4.5.1
##
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
##
## col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(tidyr)
library(plyr)
## Warning: package 'plyr' was built under R version 4.5.1
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(rsample)
##
## Attaching package: 'rsample'
## The following object is masked from 'package:Rcpp':
##
## populate
library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.5.1
library(stats)
library(yardstick)
##
## Attaching package: 'yardstick'
## The following object is masked from 'package:readr':
##
## spec
library(skimr)
library(stringr)
library(caret)
## Warning: package 'caret' was built under R version 4.5.1
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
install.packages("datadictionary", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'datadictionary' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\benke\AppData\Local\Temp\Rtmp0uhMHt\downloaded_packages
library(datadictionary)
## Warning: package 'datadictionary' was built under R version 4.5.2
#Completion of a preliminary inspection of the data
skim(Diabetes.df)
| Name | Diabetes.df |
| Number of rows | 101766 |
| Number of columns | 50 |
| _______________________ | |
| Column type frequency: | |
| character | 37 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| race | 0 | 1 | 1 | 15 | 0 | 6 | 0 |
| gender | 0 | 1 | 4 | 15 | 0 | 3 | 0 |
| age | 0 | 1 | 6 | 8 | 0 | 10 | 0 |
| weight | 0 | 1 | 1 | 9 | 0 | 10 | 0 |
| payer_code | 0 | 1 | 1 | 2 | 0 | 18 | 0 |
| medical_specialty | 0 | 1 | 1 | 36 | 0 | 73 | 0 |
| diag_1 | 0 | 1 | 1 | 6 | 0 | 717 | 0 |
| diag_2 | 0 | 1 | 1 | 6 | 0 | 749 | 0 |
| diag_3 | 0 | 1 | 1 | 6 | 0 | 790 | 0 |
| max_glu_serum | 0 | 1 | 4 | 4 | 0 | 4 | 0 |
| A1Cresult | 0 | 1 | 2 | 4 | 0 | 4 | 0 |
| metformin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| repaglinide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| nateglinide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| chlorpropamide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| glimepiride | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| acetohexamide | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| glipizide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| glyburide | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| tolbutamide | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| pioglitazone | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| rosiglitazone | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| acarbose | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| miglitol | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| troglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| tolazamide | 0 | 1 | 2 | 6 | 0 | 3 | 0 |
| examide | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
| citoglipton | 0 | 1 | 2 | 2 | 0 | 1 | 0 |
| insulin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| glyburide.metformin | 0 | 1 | 2 | 6 | 0 | 4 | 0 |
| glipizide.metformin | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| glimepiride.pioglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| metformin.rosiglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| metformin.pioglitazone | 0 | 1 | 2 | 6 | 0 | 2 | 0 |
| change | 0 | 1 | 2 | 2 | 0 | 2 | 0 |
| diabetesMed | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| readmitted | 0 | 1 | 2 | 3 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| encounter_id | 0 | 1 | 165201645.62 | 102640295.98 | 12522 | 84961194 | 152388987 | 230270888 | 443867222 | ▆▇▅▂▂ |
| patient_nbr | 0 | 1 | 54330400.69 | 38696359.35 | 135 | 23413221 | 45505143 | 87545950 | 189502619 | ▇▆▆▁▁ |
| admission_type_id | 0 | 1 | 2.02 | 1.45 | 1 | 1 | 1 | 3 | 8 | ▇▂▁▁▁ |
| discharge_disposition_id | 0 | 1 | 3.72 | 5.28 | 1 | 1 | 1 | 4 | 28 | ▇▁▁▁▁ |
| admission_source_id | 0 | 1 | 5.75 | 4.06 | 1 | 1 | 7 | 7 | 25 | ▅▇▁▁▁ |
| time_in_hospital | 0 | 1 | 4.40 | 2.99 | 1 | 2 | 4 | 6 | 14 | ▇▅▂▁▁ |
| num_lab_procedures | 0 | 1 | 43.10 | 19.67 | 1 | 31 | 44 | 57 | 132 | ▃▇▅▁▁ |
| num_procedures | 0 | 1 | 1.34 | 1.71 | 0 | 0 | 1 | 2 | 6 | ▇▂▁▁▁ |
| num_medications | 0 | 1 | 16.02 | 8.13 | 1 | 10 | 15 | 20 | 81 | ▇▃▁▁▁ |
| number_outpatient | 0 | 1 | 0.37 | 1.27 | 0 | 0 | 0 | 0 | 42 | ▇▁▁▁▁ |
| number_emergency | 0 | 1 | 0.20 | 0.93 | 0 | 0 | 0 | 0 | 76 | ▇▁▁▁▁ |
| number_inpatient | 0 | 1 | 0.64 | 1.26 | 0 | 0 | 0 | 1 | 21 | ▇▁▁▁▁ |
| number_diagnoses | 0 | 1 | 7.42 | 1.93 | 1 | 6 | 8 | 9 | 16 | ▁▅▇▁▁ |
The dataset consists of 101,766 observations and 50 column variables. Although it was anticipated that within the 50 column variables, continuous, numerical values would be easily identifiable, this was not the case upon inspection. Only variables related to the number of medical occurrences, such as lab procedures or medications, included numeric, integer values. The remaining variables were categorical, nominal variables.
The dataset consists of six variables related to demographic data, including the patient’s medical record, race, gender, age, and weight. Fifteen variables included information about the medical course, including but not limited to referral source, payer codes, and diagnosis codes. Additionally, several variables were related the number of medical occurrences, including lab procedures, medications, outpatient visits, emergency room visits, inpatient admissions, and diagnoses. Two variables related to lab results (glucose serum and A1C result). Twenty-three variables related to medication course, with four factors (down, no, steady, and up). The last three variables were related to change in medical status, whether diabetes was medicated, and whether the patient was re-admitted. A full data dictionary is present on the R Markdown HTML file.
#Further inspection of the data
glimpse(Diabetes.df)
## Rows: 101,766
## Columns: 50
## $ encounter_id <int> 2278392, 149190, 64410, 500364, 16680, 35754,…
## $ patient_nbr <int> 8222157, 55629189, 86047875, 82442376, 425192…
## $ race <chr> "Caucasian", "Caucasian", "AfricanAmerican", …
## $ gender <chr> "Female", "Female", "Female", "Male", "Male",…
## $ age <chr> "[0-10)", "[10-20)", "[20-30)", "[30-40)", "[…
## $ weight <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
## $ admission_type_id <int> 6, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 1, 3, …
## $ discharge_disposition_id <int> 25, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 6, 1,…
## $ admission_source_id <int> 1, 7, 7, 7, 7, 2, 2, 7, 4, 4, 7, 4, 7, 7, 2, …
## $ time_in_hospital <int> 1, 3, 2, 2, 1, 3, 4, 5, 13, 12, 9, 7, 7, 10, …
## $ payer_code <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
## $ medical_specialty <chr> "Pediatrics-Endocrinology", "?", "?", "?", "?…
## $ num_lab_procedures <int> 41, 59, 11, 44, 51, 31, 70, 73, 68, 33, 47, 6…
## $ num_procedures <int> 0, 0, 5, 1, 0, 6, 1, 0, 2, 3, 2, 0, 0, 1, 5, …
## $ num_medications <int> 1, 18, 13, 16, 8, 16, 21, 12, 28, 18, 17, 11,…
## $ number_outpatient <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ number_emergency <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ number_inpatient <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diag_1 <chr> "250.83", "276", "648", "8", "197", "414", "4…
## $ diag_2 <chr> "?", "250.01", "250", "250.43", "157", "411",…
## $ diag_3 <chr> "?", "255", "V27", "403", "250", "250", "V45"…
## $ number_diagnoses <int> 1, 9, 6, 7, 5, 9, 7, 8, 8, 8, 9, 7, 8, 8, 8, …
## $ max_glu_serum <chr> "None", "None", "None", "None", "None", "None…
## $ A1Cresult <chr> "None", "None", "None", "None", "None", "None…
## $ metformin <chr> "No", "No", "No", "No", "No", "No", "Steady",…
## $ repaglinide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ nateglinide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ chlorpropamide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glimepiride <chr> "No", "No", "No", "No", "No", "No", "Steady",…
## $ acetohexamide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glipizide <chr> "No", "No", "Steady", "No", "Steady", "No", "…
## $ glyburide <chr> "No", "No", "No", "No", "No", "No", "No", "St…
## $ tolbutamide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ rosiglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ acarbose <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ miglitol <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ troglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ tolazamide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ examide <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ citoglipton <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ insulin <chr> "No", "Up", "No", "Up", "Steady", "Steady", "…
## $ glyburide.metformin <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glipizide.metformin <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glimepiride.pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ metformin.rosiglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ metformin.pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ change <chr> "No", "Ch", "No", "Ch", "Ch", "No", "Ch", "No…
## $ diabetesMed <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…
## $ readmitted <chr> "NO", ">30", "NO", "NO", "NO", ">30", "NO", "…
View(Diabetes.df)
#Further data inspection
head(Diabetes.df)
## encounter_id patient_nbr race gender age weight
## 1 2278392 8222157 Caucasian Female [0-10) ?
## 2 149190 55629189 Caucasian Female [10-20) ?
## 3 64410 86047875 AfricanAmerican Female [20-30) ?
## 4 500364 82442376 Caucasian Male [30-40) ?
## 5 16680 42519267 Caucasian Male [40-50) ?
## 6 35754 82637451 Caucasian Male [50-60) ?
## admission_type_id discharge_disposition_id admission_source_id
## 1 6 25 1
## 2 1 1 7
## 3 1 1 7
## 4 1 1 7
## 5 1 1 7
## 6 2 1 2
## time_in_hospital payer_code medical_specialty num_lab_procedures
## 1 1 ? Pediatrics-Endocrinology 41
## 2 3 ? ? 59
## 3 2 ? ? 11
## 4 2 ? ? 44
## 5 1 ? ? 51
## 6 3 ? ? 31
## num_procedures num_medications number_outpatient number_emergency
## 1 0 1 0 0
## 2 0 18 0 0
## 3 5 13 2 0
## 4 1 16 0 0
## 5 0 8 0 0
## 6 6 16 0 0
## number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1 0 250.83 ? ? 1 None
## 2 0 276 250.01 255 9 None
## 3 1 648 250 V27 6 None
## 4 0 8 250.43 403 7 None
## 5 0 197 157 250 5 None
## 6 0 414 411 250 9 None
## A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1 None No No No No No
## 2 None No No No No No
## 3 None No No No No No
## 4 None No No No No No
## 5 None No No No No No
## 6 None No No No No No
## acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1 No No No No No No
## 2 No No No No No No
## 3 No Steady No No No No
## 4 No No No No No No
## 5 No Steady No No No No
## 6 No No No No No No
## acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1 No No No No No No No
## 2 No No No No No No Up
## 3 No No No No No No No
## 4 No No No No No No Up
## 5 No No No No No No Steady
## 6 No No No No No No Steady
## glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1 No No No
## 2 No No No
## 3 No No No
## 4 No No No
## 5 No No No
## 6 No No No
## metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
## 1 No No No No NO
## 2 No No Ch Yes >30
## 3 No No No Yes NO
## 4 No No Ch Yes NO
## 5 No No Ch Yes NO
## 6 No No No Yes >30
#Reclassification of numeric categorical variables to factors
Diabetes.df$admission_type_id <- as.factor(Diabetes.df$admission_type_id)
Diabetes.df$discharge_disposition_id <- as.factor(Diabetes.df$discharge_disposition_id)
Diabetes.df$admission_source_id <- as.factor(Diabetes.df$admission_source_id)
Upon inspection, several values were identified as missing with use of the “?” or “Unknown/Invalid” entry. Missing values were managed with mean imputation for numerical variables and mode imputation for categorical variables. Categorical variables were also re-classified as factors.
#Change additional categories to factors
factor_columns_diabetes <- c(3,4, 7, 23:50)
Diabetes.df[factor_columns_diabetes] <- lapply(Diabetes.df[factor_columns_diabetes], function(col) as.factor(as.character(col)))
#Creation of a data dictionary
dict.Diabetes.df <- create_dictionary(Diabetes.df)
## Warning in factor_summary(dataset, column): discharge_disposition_id has more
## than 10 levels, did you want a character variable?
## Warning in factor_summary(dataset, column): admission_source_id has more than
## 10 levels, did you want a character variable?
Diabetes.labels <- c(
encounter_id = "Patient encounter ID",
patient_nbr = "Patient ID number",
race = "Race",
gender = "Gender",
age = "Age",
weight = "Weight",
admission_type_id = "Admission Type",
discharge_disposition_id = "Disposition at Discharge",
admission_source_id = "Source of Admission",
time_in_hospital = "Time in Hospital",
payer_code = "Payer Code",
medical_specialty = "Medical Specialty",
num_lab_procedures = "Number of Lab Procedures",
num_procedures = "Number of Procedures",
num_medications = "Number of Medications",
number_outpatient = "Number of Outpatient Visits",
number_emergency = "Number of Emergency Room Visits",
number_inpatient = "Number of Inpatient Admissions",
diag_1 = "Diagnosis 1",
diag_2 = "Dignosis 2",
diag_3 = "Diagnosis 3",
number_diagnoses = "Number of Diagnoses",
max_glu_serum = "Maximum Glucose",
A1Cresult = "Result or A1C",
metformin = "Medication yes/no: Metformin",
repaglinide = "Medication yes/no: Repaglinide",
nateglinide = "Medication yes/no: Nateglinide",
chlorpropamide = "Medication yes/no: Chlorpropamide",
glimepiride = "Medication yes/no: Glimepiride",
acetohexamide = "Medication yes/no: Acetohexamide",
glipizide = "Medication yes/no: Glipizide",
acetohexamide = "Medication yes/no: Acetohexamide",
glyburide = "Medication yes/no: Glyburide",
tolbutamide = "Medication yes/no: Tolbutamide",
pioglitazone = "Medication yes/no: Pioglitazone",
rosiglitazone = "Medication yes/no: Rosiglitazone",
acarbose = "Medication yes/no: Acarbose",
miglitol = "Medication yes/no: Miglitol",
troglitazone = "Medication yes/no: Troglitazone",
examide = "Medication yes/no: Examide",
citoglipton = "Medication yes/no: Citoglipton",
insulin = "Insulin Use",
glyburide.metformin = "Medication Combination: Glyburide-Metformin yes/no",
glipizide.metformin = "Medication Combination: Glipizide-Metformin yes/no",
glimepiride.pioglitazone = "Medication Combination: Glimepiride-Pioglitazone yes/no",
metformin.rosiglitazone = "Medication Combination: Metformin-Rosiglitazone yes/no",
metformin.pioglitazone = "Medication Combination: Metformin-Pioglitazone yes/no",
change = "Change in Medical Status",
diabetesMed = "Medicated for Diabetes",
readmitted = "Readmitted")
create_dictionary(Diabetes.df, var_labels = Diabetes.labels)
## Warning in factor_summary(dataset, column): discharge_disposition_id has more
## than 10 levels, did you want a character variable?
## Warning in factor_summary(dataset, column): admission_source_id has more than
## 10 levels, did you want a character variable?
## item
## 1
## 2
## 3 encounter_id
## 4
## 5
## 6
## 7
## 8 patient_nbr
## 9
## 10
## 11
## 12
## 13 race
## 14
## 15
## 16
## 17
## 18
## 19
## 20 gender
## 21
## 22
## 23
## 24 age
## 25
## 26 weight
## 27
## 28 admission_type_id
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37 discharge_disposition_id
## 38
## 39
## 40
## 41
## 42
## 43
## 44
## 45
## 46
## 47
## 48
## 49
## 50
## 51
## 52
## 53
## 54
## 55
## 56
## 57
## 58
## 59
## 60
## 61
## 62
## 63
## 64 admission_source_id
## 65
## 66
## 67
## 68
## 69
## 70
## 71
## 72
## 73
## 74
## 75
## 76
## 77
## 78
## 79
## 80
## 81
## 82 time_in_hospital
## 83
## 84
## 85
## 86
## 87 payer_code
## 88
## 89 medical_specialty
## 90
## 91 num_lab_procedures
## 92
## 93
## 94
## 95
## 96 num_procedures
## 97
## 98
## 99
## 100
## 101 num_medications
## 102
## 103
## 104
## 105
## 106 number_outpatient
## 107
## 108
## 109
## 110
## 111 number_emergency
## 112
## 113
## 114
## 115
## 116 number_inpatient
## 117
## 118
## 119
## 120
## 121 diag_1
## 122
## 123 diag_2
## 124
## 125 diag_3
## 126
## 127 number_diagnoses
## 128
## 129
## 130
## 131
## 132 max_glu_serum
## 133
## 134
## 135
## 136
## 137 A1Cresult
## 138
## 139
## 140
## 141
## 142 metformin
## 143
## 144
## 145
## 146
## 147 repaglinide
## 148
## 149
## 150
## 151
## 152 nateglinide
## 153
## 154
## 155
## 156
## 157 chlorpropamide
## 158
## 159
## 160
## 161
## 162 glimepiride
## 163
## 164
## 165
## 166
## 167 acetohexamide
## 168
## 169
## 170 glipizide
## 171
## 172
## 173
## 174
## 175 glyburide
## 176
## 177
## 178
## 179
## 180 tolbutamide
## 181
## 182
## 183 pioglitazone
## 184
## 185
## 186
## 187
## 188 rosiglitazone
## 189
## 190
## 191
## 192
## 193 acarbose
## 194
## 195
## 196
## 197
## 198 miglitol
## 199
## 200
## 201
## 202
## 203 troglitazone
## 204
## 205
## 206 tolazamide
## 207
## 208
## 209
## 210 examide
## 211
## 212 citoglipton
## 213
## 214 insulin
## 215
## 216
## 217
## 218
## 219 glyburide.metformin
## 220
## 221
## 222
## 223
## 224 glipizide.metformin
## 225
## 226
## 227 glimepiride.pioglitazone
## 228
## 229
## 230 metformin.rosiglitazone
## 231
## 232
## 233 metformin.pioglitazone
## 234
## 235
## 236 change
## 237
## 238
## 239 diabetesMed
## 240
## 241
## 242 readmitted
## 243
## 244
## 245
## label class
## 1
## 2
## 3 Patient encounter ID integer
## 4
## 5
## 6
## 7
## 8 Patient ID number integer
## 9
## 10
## 11
## 12
## 13 Race factor
## 14
## 15
## 16
## 17
## 18
## 19
## 20 Gender factor
## 21
## 22
## 23
## 24 Age character
## 25
## 26 Weight character
## 27
## 28 Admission Type factor
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37 Disposition at Discharge factor
## 38
## 39
## 40
## 41
## 42
## 43
## 44
## 45
## 46
## 47
## 48
## 49
## 50
## 51
## 52
## 53
## 54
## 55
## 56
## 57
## 58
## 59
## 60
## 61
## 62
## 63
## 64 Source of Admission factor
## 65
## 66
## 67
## 68
## 69
## 70
## 71
## 72
## 73
## 74
## 75
## 76
## 77
## 78
## 79
## 80
## 81
## 82 Time in Hospital integer
## 83
## 84
## 85
## 86
## 87 Payer Code character
## 88
## 89 Medical Specialty character
## 90
## 91 Number of Lab Procedures integer
## 92
## 93
## 94
## 95
## 96 Number of Procedures integer
## 97
## 98
## 99
## 100
## 101 Number of Medications integer
## 102
## 103
## 104
## 105
## 106 Number of Outpatient Visits integer
## 107
## 108
## 109
## 110
## 111 Number of Emergency Room Visits integer
## 112
## 113
## 114
## 115
## 116 Number of Inpatient Admissions integer
## 117
## 118
## 119
## 120
## 121 Diagnosis 1 character
## 122
## 123 Dignosis 2 character
## 124
## 125 Diagnosis 3 character
## 126
## 127 Number of Diagnoses integer
## 128
## 129
## 130
## 131
## 132 Maximum Glucose factor
## 133
## 134
## 135
## 136
## 137 Result or A1C factor
## 138
## 139
## 140
## 141
## 142 Medication yes/no: Metformin factor
## 143
## 144
## 145
## 146
## 147 Medication yes/no: Repaglinide factor
## 148
## 149
## 150
## 151
## 152 Medication yes/no: Nateglinide factor
## 153
## 154
## 155
## 156
## 157 Medication yes/no: Chlorpropamide factor
## 158
## 159
## 160
## 161
## 162 Medication yes/no: Glimepiride factor
## 163
## 164
## 165
## 166
## 167 Medication yes/no: Acetohexamide factor
## 168
## 169
## 170 Medication yes/no: Glipizide factor
## 171
## 172
## 173
## 174
## 175 Medication yes/no: Acetohexamide factor
## 176
## 177
## 178
## 179
## 180 Medication yes/no: Glyburide factor
## 181
## 182
## 183 Medication yes/no: Tolbutamide factor
## 184
## 185
## 186
## 187
## 188 Medication yes/no: Pioglitazone factor
## 189
## 190
## 191
## 192
## 193 Medication yes/no: Rosiglitazone factor
## 194
## 195
## 196
## 197
## 198 Medication yes/no: Acarbose factor
## 199
## 200
## 201
## 202
## 203 Medication yes/no: Miglitol factor
## 204
## 205
## 206 Medication yes/no: Troglitazone factor
## 207
## 208
## 209
## 210 Medication yes/no: Examide factor
## 211
## 212 Medication yes/no: Citoglipton factor
## 213
## 214 Insulin Use factor
## 215
## 216
## 217
## 218
## 219 Medication Combination: Glyburide-Metformin yes/no factor
## 220
## 221
## 222
## 223
## 224 Medication Combination: Glipizide-Metformin yes/no factor
## 225
## 226
## 227 Medication Combination: Glimepiride-Pioglitazone yes/no factor
## 228
## 229
## 230 Medication Combination: Metformin-Rosiglitazone yes/no factor
## 231
## 232
## 233 Medication Combination: Metformin-Pioglitazone yes/no factor
## 234
## 235
## 236 Change in Medical Status factor
## 237
## 238
## 239 Medicated for Diabetes factor
## 240
## 241
## 242 Readmitted factor
## 243
## 244
## 245
## summary value
## 1 Rows in dataset 101766
## 2 Columns in dataset 50
## 3 mean 165201646
## 4 median 152388987
## 5 min 12522
## 6 max 443867222
## 7 missing 0
## 8 mean 54330401
## 9 median 45505143
## 10 min 135
## 11 max 189502619
## 12 missing 0
## 13 ? (1) 2273
## 14 AfricanAmerican (2) 19210
## 15 Asian (3) 641
## 16 Caucasian (4) 76099
## 17 Hispanic (5) 2037
## 18 Other (6) 1506
## 19 missing 0
## 20 Female (1) 54708
## 21 Male (2) 47055
## 22 Unknown/Invalid (3) 3
## 23 missing 0
## 24 unique responses 10
## 25 missing 0
## 26 unique responses 10
## 27 missing 0
## 28 1 (1) 53990
## 29 2 (2) 18480
## 30 3 (3) 18869
## 31 4 (4) 10
## 32 5 (5) 4785
## 33 6 (6) 5291
## 34 7 (7) 21
## 35 8 (8) 320
## 36 missing 0
## 37 1 (1) 60234
## 38 2 (2) 2128
## 39 3 (3) 13954
## 40 4 (4) 815
## 41 5 (5) 1184
## 42 6 (6) 12902
## 43 7 (7) 623
## 44 8 (8) 108
## 45 9 (9) 21
## 46 10 (10) 6
## 47 11 (11) 1642
## 48 12 (12) 3
## 49 13 (13) 399
## 50 14 (14) 372
## 51 15 (15) 63
## 52 16 (16) 11
## 53 17 (17) 14
## 54 18 (18) 3691
## 55 19 (19) 8
## 56 20 (20) 2
## 57 22 (21) 1993
## 58 23 (22) 412
## 59 24 (23) 48
## 60 25 (24) 989
## 61 27 (25) 5
## 62 28 (26) 139
## 63 missing 0
## 64 1 (1) 29565
## 65 2 (2) 1104
## 66 3 (3) 187
## 67 4 (4) 3187
## 68 5 (5) 855
## 69 6 (6) 2264
## 70 7 (7) 57494
## 71 8 (8) 16
## 72 9 (9) 125
## 73 10 (10) 8
## 74 11 (11) 2
## 75 13 (12) 1
## 76 14 (13) 2
## 77 17 (14) 6781
## 78 20 (15) 161
## 79 22 (16) 12
## 80 25 (17) 2
## 81 missing 0
## 82 mean 4
## 83 median 4
## 84 min 1
## 85 max 14
## 86 missing 0
## 87 unique responses 18
## 88 missing 0
## 89 unique responses 73
## 90 missing 0
## 91 mean 43
## 92 median 44
## 93 min 1
## 94 max 132
## 95 missing 0
## 96 mean 1
## 97 median 1
## 98 min 0
## 99 max 6
## 100 missing 0
## 101 mean 16
## 102 median 15
## 103 min 1
## 104 max 81
## 105 missing 0
## 106 mean 0
## 107 median 0
## 108 min 0
## 109 max 42
## 110 missing 0
## 111 mean 0
## 112 median 0
## 113 min 0
## 114 max 76
## 115 missing 0
## 116 mean 1
## 117 median 0
## 118 min 0
## 119 max 21
## 120 missing 0
## 121 unique responses 717
## 122 missing 0
## 123 unique responses 749
## 124 missing 0
## 125 unique responses 790
## 126 missing 0
## 127 mean 7
## 128 median 8
## 129 min 1
## 130 max 16
## 131 missing 0
## 132 >200 (1) 1485
## 133 >300 (2) 1264
## 134 None (3) 96420
## 135 Norm (4) 2597
## 136 missing 0
## 137 >7 (1) 3812
## 138 >8 (2) 8216
## 139 None (3) 84748
## 140 Norm (4) 4990
## 141 missing 0
## 142 Down (1) 575
## 143 No (2) 81778
## 144 Steady (3) 18346
## 145 Up (4) 1067
## 146 missing 0
## 147 Down (1) 45
## 148 No (2) 100227
## 149 Steady (3) 1384
## 150 Up (4) 110
## 151 missing 0
## 152 Down (1) 11
## 153 No (2) 101063
## 154 Steady (3) 668
## 155 Up (4) 24
## 156 missing 0
## 157 Down (1) 1
## 158 No (2) 101680
## 159 Steady (3) 79
## 160 Up (4) 6
## 161 missing 0
## 162 Down (1) 194
## 163 No (2) 96575
## 164 Steady (3) 4670
## 165 Up (4) 327
## 166 missing 0
## 167 No (1) 101765
## 168 Steady (2) 1
## 169 missing 0
## 170 Down (1) 560
## 171 No (2) 89080
## 172 Steady (3) 11356
## 173 Up (4) 770
## 174 missing 0
## 175 Down (1) 564
## 176 No (2) 91116
## 177 Steady (3) 9274
## 178 Up (4) 812
## 179 missing 0
## 180 No (1) 101743
## 181 Steady (2) 23
## 182 missing 0
## 183 Down (1) 118
## 184 No (2) 94438
## 185 Steady (3) 6976
## 186 Up (4) 234
## 187 missing 0
## 188 Down (1) 87
## 189 No (2) 95401
## 190 Steady (3) 6100
## 191 Up (4) 178
## 192 missing 0
## 193 Down (1) 3
## 194 No (2) 101458
## 195 Steady (3) 295
## 196 Up (4) 10
## 197 missing 0
## 198 Down (1) 5
## 199 No (2) 101728
## 200 Steady (3) 31
## 201 Up (4) 2
## 202 missing 0
## 203 No (1) 101763
## 204 Steady (2) 3
## 205 missing 0
## 206 No (1) 101727
## 207 Steady (2) 38
## 208 Up (3) 1
## 209 missing 0
## 210 No (1) 101766
## 211 missing 0
## 212 No (1) 101766
## 213 missing 0
## 214 Down (1) 12218
## 215 No (2) 47383
## 216 Steady (3) 30849
## 217 Up (4) 11316
## 218 missing 0
## 219 Down (1) 6
## 220 No (2) 101060
## 221 Steady (3) 692
## 222 Up (4) 8
## 223 missing 0
## 224 No (1) 101753
## 225 Steady (2) 13
## 226 missing 0
## 227 No (1) 101765
## 228 Steady (2) 1
## 229 missing 0
## 230 No (1) 101764
## 231 Steady (2) 2
## 232 missing 0
## 233 No (1) 101765
## 234 Steady (2) 1
## 235 missing 0
## 236 Ch (1) 47011
## 237 No (2) 54755
## 238 missing 0
## 239 No (1) 23403
## 240 Yes (2) 78363
## 241 missing 0
## 242 <30 (1) 11357
## 243 >30 (2) 35545
## 244 NO (3) 54864
## 245 missing 0
kable(dict.Diabetes.df, caption = "Data Dictionary")
| item | label | class | summary | value |
|---|---|---|---|---|
| Rows in dataset | 101766 | |||
| Columns in dataset | 50 | |||
| encounter_id | No label | integer | mean | 165201646 |
| median | 152388987 | |||
| min | 12522 | |||
| max | 443867222 | |||
| missing | 0 | |||
| patient_nbr | No label | integer | mean | 54330401 |
| median | 45505143 | |||
| min | 135 | |||
| max | 189502619 | |||
| missing | 0 | |||
| race | No label | factor | ? (1) | 2273 |
| AfricanAmerican (2) | 19210 | |||
| Asian (3) | 641 | |||
| Caucasian (4) | 76099 | |||
| Hispanic (5) | 2037 | |||
| Other (6) | 1506 | |||
| missing | 0 | |||
| gender | No label | factor | Female (1) | 54708 |
| Male (2) | 47055 | |||
| Unknown/Invalid (3) | 3 | |||
| missing | 0 | |||
| age | No label | character | unique responses | 10 |
| missing | 0 | |||
| weight | No label | character | unique responses | 10 |
| missing | 0 | |||
| admission_type_id | No label | factor | 1 (1) | 53990 |
| 2 (2) | 18480 | |||
| 3 (3) | 18869 | |||
| 4 (4) | 10 | |||
| 5 (5) | 4785 | |||
| 6 (6) | 5291 | |||
| 7 (7) | 21 | |||
| 8 (8) | 320 | |||
| missing | 0 | |||
| discharge_disposition_id | No label | factor | 1 (1) | 60234 |
| 2 (2) | 2128 | |||
| 3 (3) | 13954 | |||
| 4 (4) | 815 | |||
| 5 (5) | 1184 | |||
| 6 (6) | 12902 | |||
| 7 (7) | 623 | |||
| 8 (8) | 108 | |||
| 9 (9) | 21 | |||
| 10 (10) | 6 | |||
| 11 (11) | 1642 | |||
| 12 (12) | 3 | |||
| 13 (13) | 399 | |||
| 14 (14) | 372 | |||
| 15 (15) | 63 | |||
| 16 (16) | 11 | |||
| 17 (17) | 14 | |||
| 18 (18) | 3691 | |||
| 19 (19) | 8 | |||
| 20 (20) | 2 | |||
| 22 (21) | 1993 | |||
| 23 (22) | 412 | |||
| 24 (23) | 48 | |||
| 25 (24) | 989 | |||
| 27 (25) | 5 | |||
| 28 (26) | 139 | |||
| missing | 0 | |||
| admission_source_id | No label | factor | 1 (1) | 29565 |
| 2 (2) | 1104 | |||
| 3 (3) | 187 | |||
| 4 (4) | 3187 | |||
| 5 (5) | 855 | |||
| 6 (6) | 2264 | |||
| 7 (7) | 57494 | |||
| 8 (8) | 16 | |||
| 9 (9) | 125 | |||
| 10 (10) | 8 | |||
| 11 (11) | 2 | |||
| 13 (12) | 1 | |||
| 14 (13) | 2 | |||
| 17 (14) | 6781 | |||
| 20 (15) | 161 | |||
| 22 (16) | 12 | |||
| 25 (17) | 2 | |||
| missing | 0 | |||
| time_in_hospital | No label | integer | mean | 4 |
| median | 4 | |||
| min | 1 | |||
| max | 14 | |||
| missing | 0 | |||
| payer_code | No label | character | unique responses | 18 |
| missing | 0 | |||
| medical_specialty | No label | character | unique responses | 73 |
| missing | 0 | |||
| num_lab_procedures | No label | integer | mean | 43 |
| median | 44 | |||
| min | 1 | |||
| max | 132 | |||
| missing | 0 | |||
| num_procedures | No label | integer | mean | 1 |
| median | 1 | |||
| min | 0 | |||
| max | 6 | |||
| missing | 0 | |||
| num_medications | No label | integer | mean | 16 |
| median | 15 | |||
| min | 1 | |||
| max | 81 | |||
| missing | 0 | |||
| number_outpatient | No label | integer | mean | 0 |
| median | 0 | |||
| min | 0 | |||
| max | 42 | |||
| missing | 0 | |||
| number_emergency | No label | integer | mean | 0 |
| median | 0 | |||
| min | 0 | |||
| max | 76 | |||
| missing | 0 | |||
| number_inpatient | No label | integer | mean | 1 |
| median | 0 | |||
| min | 0 | |||
| max | 21 | |||
| missing | 0 | |||
| diag_1 | No label | character | unique responses | 717 |
| missing | 0 | |||
| diag_2 | No label | character | unique responses | 749 |
| missing | 0 | |||
| diag_3 | No label | character | unique responses | 790 |
| missing | 0 | |||
| number_diagnoses | No label | integer | mean | 7 |
| median | 8 | |||
| min | 1 | |||
| max | 16 | |||
| missing | 0 | |||
| max_glu_serum | No label | factor | >200 (1) | 1485 |
| >300 (2) | 1264 | |||
| None (3) | 96420 | |||
| Norm (4) | 2597 | |||
| missing | 0 | |||
| A1Cresult | No label | factor | >7 (1) | 3812 |
| >8 (2) | 8216 | |||
| None (3) | 84748 | |||
| Norm (4) | 4990 | |||
| missing | 0 | |||
| metformin | No label | factor | Down (1) | 575 |
| No (2) | 81778 | |||
| Steady (3) | 18346 | |||
| Up (4) | 1067 | |||
| missing | 0 | |||
| repaglinide | No label | factor | Down (1) | 45 |
| No (2) | 100227 | |||
| Steady (3) | 1384 | |||
| Up (4) | 110 | |||
| missing | 0 | |||
| nateglinide | No label | factor | Down (1) | 11 |
| No (2) | 101063 | |||
| Steady (3) | 668 | |||
| Up (4) | 24 | |||
| missing | 0 | |||
| chlorpropamide | No label | factor | Down (1) | 1 |
| No (2) | 101680 | |||
| Steady (3) | 79 | |||
| Up (4) | 6 | |||
| missing | 0 | |||
| glimepiride | No label | factor | Down (1) | 194 |
| No (2) | 96575 | |||
| Steady (3) | 4670 | |||
| Up (4) | 327 | |||
| missing | 0 | |||
| acetohexamide | No label | factor | No (1) | 101765 |
| Steady (2) | 1 | |||
| missing | 0 | |||
| glipizide | No label | factor | Down (1) | 560 |
| No (2) | 89080 | |||
| Steady (3) | 11356 | |||
| Up (4) | 770 | |||
| missing | 0 | |||
| glyburide | No label | factor | Down (1) | 564 |
| No (2) | 91116 | |||
| Steady (3) | 9274 | |||
| Up (4) | 812 | |||
| missing | 0 | |||
| tolbutamide | No label | factor | No (1) | 101743 |
| Steady (2) | 23 | |||
| missing | 0 | |||
| pioglitazone | No label | factor | Down (1) | 118 |
| No (2) | 94438 | |||
| Steady (3) | 6976 | |||
| Up (4) | 234 | |||
| missing | 0 | |||
| rosiglitazone | No label | factor | Down (1) | 87 |
| No (2) | 95401 | |||
| Steady (3) | 6100 | |||
| Up (4) | 178 | |||
| missing | 0 | |||
| acarbose | No label | factor | Down (1) | 3 |
| No (2) | 101458 | |||
| Steady (3) | 295 | |||
| Up (4) | 10 | |||
| missing | 0 | |||
| miglitol | No label | factor | Down (1) | 5 |
| No (2) | 101728 | |||
| Steady (3) | 31 | |||
| Up (4) | 2 | |||
| missing | 0 | |||
| troglitazone | No label | factor | No (1) | 101763 |
| Steady (2) | 3 | |||
| missing | 0 | |||
| tolazamide | No label | factor | No (1) | 101727 |
| Steady (2) | 38 | |||
| Up (3) | 1 | |||
| missing | 0 | |||
| examide | No label | factor | No (1) | 101766 |
| missing | 0 | |||
| citoglipton | No label | factor | No (1) | 101766 |
| missing | 0 | |||
| insulin | No label | factor | Down (1) | 12218 |
| No (2) | 47383 | |||
| Steady (3) | 30849 | |||
| Up (4) | 11316 | |||
| missing | 0 | |||
| glyburide.metformin | No label | factor | Down (1) | 6 |
| No (2) | 101060 | |||
| Steady (3) | 692 | |||
| Up (4) | 8 | |||
| missing | 0 | |||
| glipizide.metformin | No label | factor | No (1) | 101753 |
| Steady (2) | 13 | |||
| missing | 0 | |||
| glimepiride.pioglitazone | No label | factor | No (1) | 101765 |
| Steady (2) | 1 | |||
| missing | 0 | |||
| metformin.rosiglitazone | No label | factor | No (1) | 101764 |
| Steady (2) | 2 | |||
| missing | 0 | |||
| metformin.pioglitazone | No label | factor | No (1) | 101765 |
| Steady (2) | 1 | |||
| missing | 0 | |||
| change | No label | factor | Ch (1) | 47011 |
| No (2) | 54755 | |||
| missing | 0 | |||
| diabetesMed | No label | factor | No (1) | 23403 |
| Yes (2) | 78363 | |||
| missing | 0 | |||
| readmitted | No label | factor | <30 (1) | 11357 |
| >30 (2) | 35545 | |||
| NO (3) | 54864 | |||
| missing | 0 |
#Management of missing values: Missing values are currently listed as "?" or "Unknown/Invalid" and require conversion to NA
Diabetes.df[Diabetes.df == "?"] <- NA
Diabetes.df[Diabetes.df == "Unknown/Invalid"] <- NA
# Count rows with NA
rows_NA <- sum(rowSums(is.na(Diabetes.df)) > 0)
percent_row_NA <- percent(rows_NA/nrow(Diabetes.df))
rows_NA
## [1] 100723
percent_row_NA
## [1] "99%"
# Count columns with NA
col_NA <- sum(colSums(is.na(Diabetes.df)) > 0)
percent_col_NA <- percent(col_NA/ncol(Diabetes.df))
col_NA
## [1] 8
percent_col_NA
## [1] "16%"
mismapDiabetes <- missmap(Diabetes.df)
Diabetes.df <- lapply(Diabetes.df, function(col) {
if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
if (sum(!is.na(col)) > 10) {
col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
} else {
col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
}
} else if (is.factor(col)) { # Factor columns
mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
col[is.na(col)] <- mode_val
} else if (is.character(col)) { # Character columns
col[is.na(col)] <- "NA" # Replace with "NA"
}
return(col) # Return the modified column
})
Diabetes.df <- as.data.frame(Diabetes.df) # Convert the list back to a dataframe
#
# following the above method to impute, has now changed some of the statistics
# Check the updated dataset and ensure no remaining NAs
summary(Diabetes.df)
## encounter_id patient_nbr race
## Min. : 12522 Min. : 135 ? : 0
## 1st Qu.: 84961194 1st Qu.: 23413221 AfricanAmerican:19210
## Median :152388987 Median : 45505143 Asian : 641
## Mean :165201646 Mean : 54330401 Caucasian :78372
## 3rd Qu.:230270888 3rd Qu.: 87545950 Hispanic : 2037
## Max. :443867222 Max. :189502619 Other : 1506
##
## gender age weight
## Female :54711 Length:101766 Length:101766
## Male :47055 Class :character Class :character
## Unknown/Invalid: 0 Mode :character Mode :character
##
##
##
##
## admission_type_id discharge_disposition_id admission_source_id
## 1 :53990 1 :60234 7 :57494
## 3 :18869 3 :13954 1 :29565
## 2 :18480 6 :12902 17 : 6781
## 6 : 5291 18 : 3691 4 : 3187
## 5 : 4785 2 : 2128 6 : 2264
## 8 : 320 22 : 1993 2 : 1104
## (Other): 31 (Other): 6864 (Other): 1371
## time_in_hospital payer_code medical_specialty num_lab_procedures
## Min. : 1.000 Length:101766 Length:101766 Min. : 1.0
## 1st Qu.: 2.000 Class :character Class :character 1st Qu.: 31.0
## Median : 4.000 Mode :character Mode :character Median : 44.0
## Mean : 4.396 Mean : 43.1
## 3rd Qu.: 6.000 3rd Qu.: 57.0
## Max. :14.000 Max. :132.0
##
## num_procedures num_medications number_outpatient number_emergency
## Min. :0.00 Min. : 1.00 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.00 1st Qu.:10.00 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.00 Median :15.00 Median : 0.0000 Median : 0.0000
## Mean :1.34 Mean :16.02 Mean : 0.3694 Mean : 0.1978
## 3rd Qu.:2.00 3rd Qu.:20.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :6.00 Max. :81.00 Max. :42.0000 Max. :76.0000
##
## number_inpatient diag_1 diag_2 diag_3
## Min. : 0.0000 Length:101766 Length:101766 Length:101766
## 1st Qu.: 0.0000 Class :character Class :character Class :character
## Median : 0.0000 Mode :character Mode :character Mode :character
## Mean : 0.6356
## 3rd Qu.: 1.0000
## Max. :21.0000
##
## number_diagnoses max_glu_serum A1Cresult metformin repaglinide
## Min. : 1.000 >200: 1485 >7 : 3812 Down : 575 Down : 45
## 1st Qu.: 6.000 >300: 1264 >8 : 8216 No :81778 No :100227
## Median : 8.000 None:96420 None:84748 Steady:18346 Steady: 1384
## Mean : 7.423 Norm: 2597 Norm: 4990 Up : 1067 Up : 110
## 3rd Qu.: 9.000
## Max. :16.000
##
## nateglinide chlorpropamide glimepiride acetohexamide glipizide
## Down : 11 Down : 1 Down : 194 No :101765 Down : 560
## No :101063 No :101680 No :96575 Steady: 1 No :89080
## Steady: 668 Steady: 79 Steady: 4670 Steady:11356
## Up : 24 Up : 6 Up : 327 Up : 770
##
##
##
## glyburide tolbutamide pioglitazone rosiglitazone acarbose
## Down : 564 No :101743 Down : 118 Down : 87 Down : 3
## No :91116 Steady: 23 No :94438 No :95401 No :101458
## Steady: 9274 Steady: 6976 Steady: 6100 Steady: 295
## Up : 812 Up : 234 Up : 178 Up : 10
##
##
##
## miglitol troglitazone tolazamide examide citoglipton
## Down : 5 No :101763 No :101727 No:101766 No:101766
## No :101728 Steady: 3 Steady: 38
## Steady: 31 Up : 1
## Up : 2
##
##
##
## insulin glyburide.metformin glipizide.metformin
## Down :12218 Down : 6 No :101753
## No :47383 No :101060 Steady: 13
## Steady:30849 Steady: 692
## Up :11316 Up : 8
##
##
##
## glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
## No :101765 No :101764 No :101765
## Steady: 1 Steady: 2 Steady: 1
##
##
##
##
##
## change diabetesMed readmitted
## Ch:47011 No :23403 <30:11357
## No:54755 Yes:78363 >30:35545
## NO :54864
##
##
##
##
str(Diabetes.df)
## 'data.frame': 101766 obs. of 50 variables:
## $ encounter_id : num 2278392 149190 64410 500364 16680 ...
## $ patient_nbr : num 8222157 55629189 86047875 82442376 42519267 ...
## $ race : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
## $ age : chr "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
## $ weight : chr "NA" "NA" "NA" "NA" ...
## $ admission_type_id : Factor w/ 8 levels "1","2","3","4",..: 6 1 1 1 1 2 3 1 2 3 ...
## $ discharge_disposition_id: Factor w/ 26 levels "1","2","3","4",..: 24 1 1 1 1 1 1 1 1 3 ...
## $ admission_source_id : Factor w/ 17 levels "1","2","3","4",..: 1 7 7 7 7 2 2 7 4 4 ...
## $ time_in_hospital : num 1 3 2 2 1 3 4 5 13 12 ...
## $ payer_code : chr "NA" "NA" "NA" "NA" ...
## $ medical_specialty : chr "Pediatrics-Endocrinology" "NA" "NA" "NA" ...
## $ num_lab_procedures : num 41 59 11 44 51 31 70 73 68 33 ...
## $ num_procedures : num 0 0 5 1 0 6 1 0 2 3 ...
## $ num_medications : num 1 18 13 16 8 16 21 12 28 18 ...
## $ number_outpatient : num 0 0 2 0 0 0 0 0 0 0 ...
## $ number_emergency : num 0 0 0 0 0 0 0 0 0 0 ...
## $ number_inpatient : num 0 0 1 0 0 0 0 0 0 0 ...
## $ diag_1 : chr "250.83" "276" "648" "8" ...
## $ diag_2 : chr "NA" "250.01" "250" "250.43" ...
## $ diag_3 : chr "NA" "255" "V27" "403" ...
## $ number_diagnoses : num 1 9 6 7 5 9 7 8 8 8 ...
## $ max_glu_serum : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ A1Cresult : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ repaglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ nateglinide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ chlorpropamide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glimepiride : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
## $ acetohexamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glipizide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
## $ glyburide : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
## $ tolbutamide : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ pioglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ rosiglitazone : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
## $ acarbose : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ miglitol : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ troglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ tolazamide : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ examide : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ citoglipton : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
## $ insulin : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
## $ glyburide.metformin : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ glipizide.metformin : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ metformin.pioglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
## $ change : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
## $ diabetesMed : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ readmitted : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...
Statistical measures were obtained from the numerical variables of the full dataset, and of each of the sampling methods. Measures derived from numerical variables for the main dataset and the three sampled datasets included mean, median, standard deviation, variance, range, IQR, skewness, and kurtosis. Although no missing values were anticipated due to pre-processing, missing values were removed in the function as further assurance in obtaining statistical measures.
Three types of sampling methods were used for comparison. A random sample of 1000 random observations were obtained for the first sample. To create a sample of 900 observations for the stratified sample, 100 observations were taken from the age categories (0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90). Finally, a systematic sample was created by obtaining the observation for every kth observation by dividing the total number of observations (N) by the number of observations in the sample (n = 1000).
Data visualization was completed with histogram plots for the numeric variables: number of lab procedures, number of procedures, and number of medications. Histogram plots were created for the full dataset and each sample method. Additionally, Quantile-Quantile plots were created for the variables analyzed and compared to a theoretical distribution.
# Compute descriptive statistics on numeric values
compute_stats <- function(column, name) {
if (is.numeric(column) || is.integer(column)) {
data.frame(
Variable = name,
Mean = round(mean(column, na.rm = TRUE), 2),
Median = round(median(column, na.rm = TRUE), 2),
St.Deviation = round(sd(column, na.rm = TRUE), 2),
Variance = round(var(column, na.rm = TRUE), 2),
Range = round(diff(range(column, na.rm = TRUE)), 2),
IQR = round(IQR(column, na.rm = TRUE), 2),
Skewness = round(skewness(column, na.rm = TRUE), 2),
Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
stringsAsFactors = FALSE
)
} else {
NULL
}
}
descriptive_stats <- do.call(
rbind,
lapply(names(Diabetes.df), function(col) compute_stats(Diabetes.df[[col]], col))
)
descriptive_stats
## Variable Mean Median St.Deviation Variance
## 1 encounter_id 165201645.62 152388987 102640295.98 1.053503e+16
## 2 patient_nbr 54330400.69 45505143 38696359.35 1.497408e+15
## 3 time_in_hospital 4.40 4 2.99 8.910000e+00
## 4 num_lab_procedures 43.10 44 19.67 3.870800e+02
## 5 num_procedures 1.34 1 1.71 2.910000e+00
## 6 num_medications 16.02 15 8.13 6.606000e+01
## 7 number_outpatient 0.37 0 1.27 1.610000e+00
## 8 number_emergency 0.20 0 0.93 8.700000e-01
## 9 number_inpatient 0.64 0 1.26 1.590000e+00
## 10 number_diagnoses 7.42 8 1.93 3.740000e+00
## Range IQR Skewness Kurtosis
## 1 443854700 145309694 0.70 2.90
## 2 189502484 64132729 0.47 2.65
## 3 13 4 1.13 3.85
## 4 131 26 -0.24 2.75
## 5 6 2 1.32 3.86
## 6 80 10 1.33 6.47
## 7 42 0 8.83 150.90
## 8 76 0 22.86 1194.63
## 9 21 1 3.61 23.72
## 10 15 3 -0.88 2.92
col = "num_lab_procedures"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(Diabetes.df$num_lab_procedures, col = "red", lwd = 2, lty = 2)
col = "num_procedures"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 5, fill = "blue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_procedures, main = 'Q-Q Plot for Right Skewed', col = "steelblue")
qqline(Diabetes.df$num_procedures, col = "red", lwd = 2, lty = 2)
col = "num_medications"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(Diabetes.df$num_medications, col = "red", lwd = 2, lty = 2)
Diabetes.df.num <- Diabetes.df[c(13:18,22)]
head(Diabetes.df.num)
## num_lab_procedures num_procedures num_medications number_outpatient
## 1 41 0 1 0
## 2 59 0 18 0
## 3 11 5 13 2
## 4 44 1 16 0
## 5 51 0 8 0
## 6 31 6 16 0
## number_emergency number_inpatient number_diagnoses
## 1 0 0 1
## 2 0 0 9
## 3 0 1 6
## 4 0 0 7
## 5 0 0 5
## 6 0 0 9
descriptive_stats_diabetes_num <- do.call(
rbind,
lapply(names(Diabetes.df.num), function(col) compute_stats(Diabetes.df.num[[col]], col))
)
descriptive_stats_diabetes_num
## Variable Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.10 44 19.67 387.08 131 26 -0.24
## 2 num_procedures 1.34 1 1.71 2.91 6 2 1.32
## 3 num_medications 16.02 15 8.13 66.06 80 10 1.33
## 4 number_outpatient 0.37 0 1.27 1.61 42 0 8.83
## 5 number_emergency 0.20 0 0.93 0.87 76 0 22.86
## 6 number_inpatient 0.64 0 1.26 1.59 21 1 3.61
## 7 number_diagnoses 7.42 8 1.93 3.74 15 3 -0.88
## Kurtosis
## 1 2.75
## 2 3.86
## 3 6.47
## 4 150.90
## 5 1194.63
## 6 23.72
## 7 2.92
descriptive_stats_diabetes_num %>%
kbl(caption = "Descriptive Statistics - Numeric Variables") %>%
kable_classic()
| Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
| num_lab_procedures | 43.10 | 44 | 19.67 | 387.08 | 131 | 26 | -0.24 | 2.75 |
| num_procedures | 1.34 | 1 | 1.71 | 2.91 | 6 | 2 | 1.32 | 3.86 |
| num_medications | 16.02 | 15 | 8.13 | 66.06 | 80 | 10 | 1.33 | 6.47 |
| number_outpatient | 0.37 | 0 | 1.27 | 1.61 | 42 | 0 | 8.83 | 150.90 |
| number_emergency | 0.20 | 0 | 0.93 | 0.87 | 76 | 0 | 22.86 | 1194.63 |
| number_inpatient | 0.64 | 0 | 1.26 | 1.59 | 21 | 1 | 3.61 | 23.72 |
| number_diagnoses | 7.42 | 8 | 1.93 | 3.74 | 15 | 3 | -0.88 | 2.92 |
set.seed(123)
sample_1_Diabetes.df.num <- Diabetes.df.num[sample(nrow(Diabetes.df.num), 1000), ]
head(sample_1_Diabetes.df.num)
## num_lab_procedures num_procedures num_medications number_outpatient
## 51663 61 1 9 0
## 57870 64 1 20 0
## 2986 68 0 8 0
## 29925 46 5 9 0
## 95246 49 4 15 1
## 68293 47 4 30 0
## number_emergency number_inpatient number_diagnoses
## 51663 0 0 9
## 57870 0 1 9
## 2986 0 0 8
## 29925 0 0 4
## 95246 0 4 9
## 68293 0 0 9
descriptive_stats_sample_1 <- do.call(
rbind,
lapply(names(sample_1_Diabetes.df.num), function(col) compute_stats(sample_1_Diabetes.df.num[[col]], col))
)
descriptive_stats_sample_1
## Variable Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.20 44 19.87 394.78 104 26 -0.17
## 2 num_procedures 1.34 1 1.74 3.02 6 2 1.33
## 3 num_medications 15.82 15 7.91 62.56 67 10 1.20
## 4 number_outpatient 0.36 0 1.26 1.60 21 0 7.35
## 5 number_emergency 0.23 0 0.94 0.88 16 0 8.80
## 6 number_inpatient 0.70 0 1.31 1.72 15 1 3.59
## 7 number_diagnoses 7.43 8 1.86 3.46 8 3 -0.84
## Kurtosis
## 1 2.79
## 2 3.82
## 3 5.98
## 4 88.52
## 5 115.68
## 6 25.23
## 7 2.53
descriptive_stats_sample_1 %>%
kable(caption = "Descriptive Statistics - Random Sample") %>%
kable_classic()
| Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
| num_lab_procedures | 43.20 | 44 | 19.87 | 394.78 | 104 | 26 | -0.17 | 2.79 |
| num_procedures | 1.34 | 1 | 1.74 | 3.02 | 6 | 2 | 1.33 | 3.82 |
| num_medications | 15.82 | 15 | 7.91 | 62.56 | 67 | 10 | 1.20 | 5.98 |
| number_outpatient | 0.36 | 0 | 1.26 | 1.60 | 21 | 0 | 7.35 | 88.52 |
| number_emergency | 0.23 | 0 | 0.94 | 0.88 | 16 | 0 | 8.80 | 115.68 |
| number_inpatient | 0.70 | 0 | 1.31 | 1.72 | 15 | 1 | 3.59 | 25.23 |
| number_diagnoses | 7.43 | 8 | 1.86 | 3.46 | 8 | 3 | -0.84 | 2.53 |
col = "num_lab_procedures"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
# Histogram for numeric variables
ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_lab_procedures, col = "red", lwd = 2, lty = 2)
col = "num_procedures"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
# Histogram for numeric variables
ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
geom_histogram(bins = 5, fill = "blue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_procedures, col = "red", lwd = 2, lty = 2)
col = "num_medications"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
# Histogram for numeric variables
ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_medications, col = "red", lwd = 2, lty = 2)
set.seed(456)
sample_2_Diabetes.df <- Diabetes.df %>%
group_by(age) %>%
sample_n(100)
head(sample_2_Diabetes.df)
## # A tibble: 6 × 50
## # Groups: age [1]
## encounter_id patient_nbr race gender age weight admission_type_id
## <dbl> <dbl> <fct> <fct> <chr> <chr> <fct>
## 1 25121856 16295499 AfricanAmerican Female [0-1… NA 3
## 2 18757914 7691157 AfricanAmerican Male [0-1… NA 3
## 3 18298074 5236596 Caucasian Male [0-1… [0-25) 6
## 4 81214752 21073491 Caucasian Male [0-1… NA 1
## 5 52344870 18844713 Caucasian Female [0-1… NA 1
## 6 59111826 2048598 Caucasian Male [0-1… NA 1
## # ℹ 43 more variables: discharge_disposition_id <fct>,
## # admission_source_id <fct>, time_in_hospital <dbl>, payer_code <chr>,
## # medical_specialty <chr>, num_lab_procedures <dbl>, num_procedures <dbl>,
## # num_medications <dbl>, number_outpatient <dbl>, number_emergency <dbl>,
## # number_inpatient <dbl>, diag_1 <chr>, diag_2 <chr>, diag_3 <chr>,
## # number_diagnoses <dbl>, max_glu_serum <fct>, A1Cresult <fct>,
## # metformin <fct>, repaglinide <fct>, nateglinide <fct>, …
descriptive_stats_sample_2 <- do.call(
rbind,
lapply(names(sample_2_Diabetes.df), function(col) compute_stats(sample_2_Diabetes.df[[col]], col))
)
descriptive_stats_sample_2
## Variable Mean Median St.Deviation Variance Range
## 1 encounter_id 1.54239e+08 145628052 102684709.86 1.054415e+16 441244556
## 2 patient_nbr 5.05900e+07 42175899 39314511.59 1.545631e+15 188223845
## 3 time_in_hospital 4.03000e+00 3 2.80 7.830000e+00 13
## 4 num_lab_procedures 4.31800e+01 45 19.08 3.640900e+02 96
## 5 num_procedures 1.01000e+00 0 1.52 2.320000e+00 6
## 6 num_medications 1.38700e+01 12 8.49 7.200000e+01 74
## 7 number_outpatient 2.70000e-01 0 0.96 9.200000e-01 15
## 8 number_emergency 2.20000e-01 0 0.86 7.500000e-01 13
## 9 number_inpatient 6.70000e-01 0 1.64 2.690000e+00 13
## 10 number_diagnoses 6.39000e+00 7 2.60 6.740000e+00 15
## IQR Skewness Kurtosis
## 1 138425463.00 0.83 3.22
## 2 66921156.00 0.62 2.77
## 3 3.00 1.29 4.35
## 4 25.00 -0.25 2.76
## 5 1.00 1.73 5.39
## 6 10.00 1.78 9.96
## 7 0.00 6.66 73.29
## 8 0.00 7.01 72.49
## 9 1.00 4.21 25.39
## 10 4.25 -0.48 2.12
descriptive_stats_sample_2 %>%
kbl(caption = "Descriptive Statistics - Stratified Sample based on Age") %>%
kable_classic()
| Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
| encounter_id | 1.54239e+08 | 145628052 | 102684709.86 | 1.054415e+16 | 441244556 | 138425463.00 | 0.83 | 3.22 |
| patient_nbr | 5.05900e+07 | 42175899 | 39314511.59 | 1.545631e+15 | 188223845 | 66921156.00 | 0.62 | 2.77 |
| time_in_hospital | 4.03000e+00 | 3 | 2.80 | 7.830000e+00 | 13 | 3.00 | 1.29 | 4.35 |
| num_lab_procedures | 4.31800e+01 | 45 | 19.08 | 3.640900e+02 | 96 | 25.00 | -0.25 | 2.76 |
| num_procedures | 1.01000e+00 | 0 | 1.52 | 2.320000e+00 | 6 | 1.00 | 1.73 | 5.39 |
| num_medications | 1.38700e+01 | 12 | 8.49 | 7.200000e+01 | 74 | 10.00 | 1.78 | 9.96 |
| number_outpatient | 2.70000e-01 | 0 | 0.96 | 9.200000e-01 | 15 | 0.00 | 6.66 | 73.29 |
| number_emergency | 2.20000e-01 | 0 | 0.86 | 7.500000e-01 | 13 | 0.00 | 7.01 | 72.49 |
| number_inpatient | 6.70000e-01 | 0 | 1.64 | 2.690000e+00 | 13 | 1.00 | 4.21 | 25.39 |
| number_diagnoses | 6.39000e+00 | 7 | 2.60 | 6.740000e+00 | 15 | 4.25 | -0.48 | 2.12 |
sample_2_num <- sample_2_Diabetes.df[c(13:18,22)]
head(sample_2_num)
## # A tibble: 6 × 7
## num_lab_procedures num_procedures num_medications number_outpatient
## <dbl> <dbl> <dbl> <dbl>
## 1 21 0 4 0
## 2 39 3 19 0
## 3 37 0 5 0
## 4 35 0 4 0
## 5 42 0 5 0
## 6 42 0 10 0
## # ℹ 3 more variables: number_emergency <dbl>, number_inpatient <dbl>,
## # number_diagnoses <dbl>
descriptive_stats_sample_2.2 <- do.call(
rbind,
lapply(names(sample_2_num), function(col) compute_stats(sample_2_num[[col]], col))
)
descriptive_stats_sample_2.2
## Variable Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.18 45 19.08 364.09 96 25.00 -0.25
## 2 num_procedures 1.01 0 1.52 2.32 6 1.00 1.73
## 3 num_medications 13.87 12 8.49 72.00 74 10.00 1.78
## 4 number_outpatient 0.27 0 0.96 0.92 15 0.00 6.66
## 5 number_emergency 0.22 0 0.86 0.75 13 0.00 7.01
## 6 number_inpatient 0.67 0 1.64 2.69 13 1.00 4.21
## 7 number_diagnoses 6.39 7 2.60 6.74 15 4.25 -0.48
## Kurtosis
## 1 2.76
## 2 5.39
## 3 9.96
## 4 73.29
## 5 72.49
## 6 25.39
## 7 2.12
descriptive_stats_sample_2.2 %>%
kbl(caption = "Descriptive Statistics - Sratified Sample based on Age") %>%
kable_classic()
| Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
| num_lab_procedures | 43.18 | 45 | 19.08 | 364.09 | 96 | 25.00 | -0.25 | 2.76 |
| num_procedures | 1.01 | 0 | 1.52 | 2.32 | 6 | 1.00 | 1.73 | 5.39 |
| num_medications | 13.87 | 12 | 8.49 | 72.00 | 74 | 10.00 | 1.78 | 9.96 |
| number_outpatient | 0.27 | 0 | 0.96 | 0.92 | 15 | 0.00 | 6.66 | 73.29 |
| number_emergency | 0.22 | 0 | 0.86 | 0.75 | 13 | 0.00 | 7.01 | 72.49 |
| number_inpatient | 0.67 | 0 | 1.64 | 2.69 | 13 | 1.00 | 4.21 | 25.39 |
| number_diagnoses | 6.39 | 7 | 2.60 | 6.74 | 15 | 4.25 | -0.48 | 2.12 |
col = "num_lab_procedures"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(sample_2_Diabetes.df$num_lab_procedures, col = "red", lwd = 2, lty = 2)
col = "num_procedures"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 5, fill = "blue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_2_Diabetes.df$num_procedures, col = "red", lwd = 2, lty = 2)
col = "num_medications"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
# Histogram for numeric variables
ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_2_Diabetes.df$num_medications, col = "red", lwd = 2, lty = 2)
N <- nrow(Diabetes.df.num)
n <- 1000
k <- floor(N/n)
set.seed(123)
start <- sample(1:k, 1)
systematic_indices_diabetes <- seq(from = start, to = N, by = k)
systematic_sample_3_diabetes <- Diabetes.df.num[systematic_indices_diabetes, ]
head(systematic_sample_3_diabetes)
## num_lab_procedures num_procedures num_medications number_outpatient
## 31 87 0 18 0
## 132 34 1 19 0
## 233 48 1 10 0
## 334 3 0 13 0
## 435 41 0 9 0
## 536 32 0 6 0
## number_emergency number_inpatient number_diagnoses
## 31 0 0 9
## 132 0 0 4
## 233 0 1 5
## 334 0 0 9
## 435 0 0 5
## 536 0 0 4
descriptive_stats_sample_3 <- do.call(
rbind,
lapply(names(systematic_sample_3_diabetes), function(col) compute_stats(systematic_sample_3_diabetes[[col]], col))
)
descriptive_stats_sample_3
## Variable Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.29 44.5 19.52 381.08 101 25 -0.28
## 2 num_procedures 1.36 1.0 1.73 3.00 6 2 1.26
## 3 num_medications 16.33 15.0 8.20 67.22 67 10 1.45
## 4 number_outpatient 0.44 0.0 1.26 1.60 11 0 4.28
## 5 number_emergency 0.20 0.0 0.70 0.50 8 0 5.32
## 6 number_inpatient 0.69 0.0 1.46 2.14 15 1 3.88
## 7 number_diagnoses 7.41 8.0 1.95 3.80 8 3 -0.95
## Kurtosis
## 1 2.76
## 2 3.60
## 3 7.96
## 4 25.16
## 5 39.73
## 6 25.08
## 7 2.83
descriptive_stats_sample_3 %>%
kbl(caption = "Descriptive Statistics - Systematic Sample") %>%
kable_classic()
| Variable | Mean | Median | St.Deviation | Variance | Range | IQR | Skewness | Kurtosis |
|---|---|---|---|---|---|---|---|---|
| num_lab_procedures | 43.29 | 44.5 | 19.52 | 381.08 | 101 | 25 | -0.28 | 2.76 |
| num_procedures | 1.36 | 1.0 | 1.73 | 3.00 | 6 | 2 | 1.26 | 3.60 |
| num_medications | 16.33 | 15.0 | 8.20 | 67.22 | 67 | 10 | 1.45 | 7.96 |
| number_outpatient | 0.44 | 0.0 | 1.26 | 1.60 | 11 | 0 | 4.28 | 25.16 |
| number_emergency | 0.20 | 0.0 | 0.70 | 0.50 | 8 | 0 | 5.32 | 39.73 |
| number_inpatient | 0.69 | 0.0 | 1.46 | 2.14 | 15 | 1 | 3.88 | 25.08 |
| number_diagnoses | 7.41 | 8.0 | 1.95 | 3.80 | 8 | 3 | -0.95 | 2.83 |
col = "num_lab_procedures"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
# Histogram for numeric variables
ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_lab_procedures, col = "red", lwd = 2, lty = 2)
col = "num_procedures"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
# Histogram for numeric variables
ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
geom_histogram(bins = 5, fill = "blue", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_procedures, col = "red", lwd = 2, lty = 2)
col = "num_medications"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
# Histogram for numeric variables
ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
theme_minimal()
}
set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_medications, col = "red", lwd = 2, lty = 2)
The measures of central tendency, including mean and median results were comparable across the three sampling methods for all numeric variables. The three sampling methods were not comparable for skewness and kurtosis for the number of outpatient visits and emergency visits. This is likely due to the outliers for both of these variables. For the number of outpatient visits, the mean is 0.3694 and the 3rd quartile is 0.000, but the maximum is 42 indicating at least one if not several outliers. Also, the mean for number of emergency visits is 0.1978 and the 3rd quartile is 0.00, but the maximum value is 76, also indicating at least one if not several outliers. Further investigation of outliers may indicate that sample statistics are more representative of the dataset than using the full dataset with outliers.
Histogram comparisons for the full dataset and the three sampled groups demonstrate a similar shape. For the variable: number of lab procedures, the variable has a normal distribution with a heavy left tail. This indicates that the three sampling methods provided a reasonably similar representation of the data.
A comparison of the Quantile-Quantile (Q-Q) plots for the full dataset and each of the samples obtained indicates a similar distribution. The heavy left tail is evident in the Q-Q plot for each of the samples.
Data analysis with the 130 Hospitals Diabetes dataset improved in efficiency without sacrificing reliability with use of three sampling methods, random sampling, stratified sampling and systematic sampling. Similar value distribution was observed across histogram plots and Q-Q plots. This could be partially due to the inherent feature selection by eliminating categorical variables and using only numerical variables for analysis. Use of numeric variables reduced the volume of features from 50 to 7. Additionally, sampling methods reduced the number of observations analyzed from over 100,000 to 1000 while obtaining comparable statistics. Strategic sampling of high-dimensionality datasets is a valuable method for compressing dimensionality for model efficiency.
Healthcare datasets provide opportunities to develop machine learning models for disease or admission prediction; however, high-dimensionality creates a challenge in model development and performance. Both sampling methods and feature selection algorithms provide methods to reduce the impact of the curse of dimensionality. The 130 Hospitals Diabetes dataset was used in this project to demonstrate the utility of sampling methods in reducing dimensionality for model performance. Sampling methods are valuable options for developing optimally performing models using high-dimensionality datasets.
References World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes