Diabetes is a serious medical condition which occurs when the pancreas does not create sufficient insulin, a hormone required to process blood sugar (World Health Organization, 2024). The World Health Organization reports increasing prevalence of diabetes worldwide, with an increase from 200 million individuals in 1990 to 860 million in 2022 (World Health Organization, 2024). Additionally, complications from diabetes result in serious medical complications or death (World Health Organization 2024). Complications and the progression of the disease can be mitigated with a variety of lifestyle changes, including but not limited to changes in food and exercise choices (World Health Organization, 2024). Early prediction of the development of diabetes and medical complications could potentially positively benefit millions of individuals.

Sampling methods are beneficial in data analysis with large, complex, and high dimensional datasets due to the increased computational efficiency in analysis of a representative sample of observations. A variety of sampling methods exist, including sampling based on traditional methods of random, stratified, and systematic sampling. The purpose of this project is determine the impact of random, stratified, and systematic sampling distributions for data analysis of the UCI 130 Hospitals in the US from 1999 – 2008 dataset (hereafter referred to as the 130 Hospitals Diabetes dataset).

#Set working directory
getwd()
## [1] "C:/Users/benke/OneDrive/NU/8521/Assignment 3"
setwd("C:/Users/benke/Downloads")

The 130 Hospitals Diabetes dataset (Clore et al., 2014) was downloaded from the UCI Machine Learning Repository for analysis. R Studio was the Interactive Development Environment (IDE) for analysis. A R Markdown document was created and knitted to an HTML file for review.

#Load dataset
Diabetes.df <- read.csv("diabetic_data.csv")
#Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Amelia)
## Warning: package 'Amelia' was built under R version 4.5.1
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(ggplot2)      
library(scales)             
library(moments)
library(readr)
## Warning: package 'readr' was built under R version 4.5.1
## 
## Attaching package: 'readr'
## The following object is masked from 'package:scales':
## 
##     col_factor
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.1
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(tidyr)
library(plyr)
## Warning: package 'plyr' was built under R version 4.5.1
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(rsample)
## 
## Attaching package: 'rsample'
## The following object is masked from 'package:Rcpp':
## 
##     populate
library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.5.1
library(stats)
library(yardstick)
## 
## Attaching package: 'yardstick'
## The following object is masked from 'package:readr':
## 
##     spec
library(skimr)
library(stringr)
library(caret)
## Warning: package 'caret' was built under R version 4.5.1
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
install.packages("datadictionary", repos = "https://cloud.r-project.org")
## Installing package into 'C:/Users/benke/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'datadictionary' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\benke\AppData\Local\Temp\Rtmp0uhMHt\downloaded_packages
library(datadictionary)
## Warning: package 'datadictionary' was built under R version 4.5.2
#Completion of a preliminary inspection of the data
skim(Diabetes.df)
Data summary
Name Diabetes.df
Number of rows 101766
Number of columns 50
_______________________
Column type frequency:
character 37
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
race 0 1 1 15 0 6 0
gender 0 1 4 15 0 3 0
age 0 1 6 8 0 10 0
weight 0 1 1 9 0 10 0
payer_code 0 1 1 2 0 18 0
medical_specialty 0 1 1 36 0 73 0
diag_1 0 1 1 6 0 717 0
diag_2 0 1 1 6 0 749 0
diag_3 0 1 1 6 0 790 0
max_glu_serum 0 1 4 4 0 4 0
A1Cresult 0 1 2 4 0 4 0
metformin 0 1 2 6 0 4 0
repaglinide 0 1 2 6 0 4 0
nateglinide 0 1 2 6 0 4 0
chlorpropamide 0 1 2 6 0 4 0
glimepiride 0 1 2 6 0 4 0
acetohexamide 0 1 2 6 0 2 0
glipizide 0 1 2 6 0 4 0
glyburide 0 1 2 6 0 4 0
tolbutamide 0 1 2 6 0 2 0
pioglitazone 0 1 2 6 0 4 0
rosiglitazone 0 1 2 6 0 4 0
acarbose 0 1 2 6 0 4 0
miglitol 0 1 2 6 0 4 0
troglitazone 0 1 2 6 0 2 0
tolazamide 0 1 2 6 0 3 0
examide 0 1 2 2 0 1 0
citoglipton 0 1 2 2 0 1 0
insulin 0 1 2 6 0 4 0
glyburide.metformin 0 1 2 6 0 4 0
glipizide.metformin 0 1 2 6 0 2 0
glimepiride.pioglitazone 0 1 2 6 0 2 0
metformin.rosiglitazone 0 1 2 6 0 2 0
metformin.pioglitazone 0 1 2 6 0 2 0
change 0 1 2 2 0 2 0
diabetesMed 0 1 2 3 0 2 0
readmitted 0 1 2 3 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
encounter_id 0 1 165201645.62 102640295.98 12522 84961194 152388987 230270888 443867222 ▆▇▅▂▂
patient_nbr 0 1 54330400.69 38696359.35 135 23413221 45505143 87545950 189502619 ▇▆▆▁▁
admission_type_id 0 1 2.02 1.45 1 1 1 3 8 ▇▂▁▁▁
discharge_disposition_id 0 1 3.72 5.28 1 1 1 4 28 ▇▁▁▁▁
admission_source_id 0 1 5.75 4.06 1 1 7 7 25 ▅▇▁▁▁
time_in_hospital 0 1 4.40 2.99 1 2 4 6 14 ▇▅▂▁▁
num_lab_procedures 0 1 43.10 19.67 1 31 44 57 132 ▃▇▅▁▁
num_procedures 0 1 1.34 1.71 0 0 1 2 6 ▇▂▁▁▁
num_medications 0 1 16.02 8.13 1 10 15 20 81 ▇▃▁▁▁
number_outpatient 0 1 0.37 1.27 0 0 0 0 42 ▇▁▁▁▁
number_emergency 0 1 0.20 0.93 0 0 0 0 76 ▇▁▁▁▁
number_inpatient 0 1 0.64 1.26 0 0 0 1 21 ▇▁▁▁▁
number_diagnoses 0 1 7.42 1.93 1 6 8 9 16 ▁▅▇▁▁

The dataset consists of 101,766 observations and 50 column variables. Although it was anticipated that within the 50 column variables, continuous, numerical values would be easily identifiable, this was not the case upon inspection. Only variables related to the number of medical occurrences, such as lab procedures or medications, included numeric, integer values. The remaining variables were categorical, nominal variables.

The dataset consists of six variables related to demographic data, including the patient’s medical record, race, gender, age, and weight. Fifteen variables included information about the medical course, including but not limited to referral source, payer codes, and diagnosis codes. Additionally, several variables were related the number of medical occurrences, including lab procedures, medications, outpatient visits, emergency room visits, inpatient admissions, and diagnoses. Two variables related to lab results (glucose serum and A1C result). Twenty-three variables related to medication course, with four factors (down, no, steady, and up). The last three variables were related to change in medical status, whether diabetes was medicated, and whether the patient was re-admitted. A full data dictionary is present on the R Markdown HTML file.

#Further inspection of the data
glimpse(Diabetes.df)
## Rows: 101,766
## Columns: 50
## $ encounter_id             <int> 2278392, 149190, 64410, 500364, 16680, 35754,…
## $ patient_nbr              <int> 8222157, 55629189, 86047875, 82442376, 425192…
## $ race                     <chr> "Caucasian", "Caucasian", "AfricanAmerican", …
## $ gender                   <chr> "Female", "Female", "Female", "Male", "Male",…
## $ age                      <chr> "[0-10)", "[10-20)", "[20-30)", "[30-40)", "[…
## $ weight                   <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
## $ admission_type_id        <int> 6, 1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 2, 1, 1, 3, …
## $ discharge_disposition_id <int> 25, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 3, 6, 1,…
## $ admission_source_id      <int> 1, 7, 7, 7, 7, 2, 2, 7, 4, 4, 7, 4, 7, 7, 2, …
## $ time_in_hospital         <int> 1, 3, 2, 2, 1, 3, 4, 5, 13, 12, 9, 7, 7, 10, …
## $ payer_code               <chr> "?", "?", "?", "?", "?", "?", "?", "?", "?", …
## $ medical_specialty        <chr> "Pediatrics-Endocrinology", "?", "?", "?", "?…
## $ num_lab_procedures       <int> 41, 59, 11, 44, 51, 31, 70, 73, 68, 33, 47, 6…
## $ num_procedures           <int> 0, 0, 5, 1, 0, 6, 1, 0, 2, 3, 2, 0, 0, 1, 5, …
## $ num_medications          <int> 1, 18, 13, 16, 8, 16, 21, 12, 28, 18, 17, 11,…
## $ number_outpatient        <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ number_emergency         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ number_inpatient         <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ diag_1                   <chr> "250.83", "276", "648", "8", "197", "414", "4…
## $ diag_2                   <chr> "?", "250.01", "250", "250.43", "157", "411",…
## $ diag_3                   <chr> "?", "255", "V27", "403", "250", "250", "V45"…
## $ number_diagnoses         <int> 1, 9, 6, 7, 5, 9, 7, 8, 8, 8, 9, 7, 8, 8, 8, …
## $ max_glu_serum            <chr> "None", "None", "None", "None", "None", "None…
## $ A1Cresult                <chr> "None", "None", "None", "None", "None", "None…
## $ metformin                <chr> "No", "No", "No", "No", "No", "No", "Steady",…
## $ repaglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ nateglinide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ chlorpropamide           <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glimepiride              <chr> "No", "No", "No", "No", "No", "No", "Steady",…
## $ acetohexamide            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glipizide                <chr> "No", "No", "Steady", "No", "Steady", "No", "…
## $ glyburide                <chr> "No", "No", "No", "No", "No", "No", "No", "St…
## $ tolbutamide              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ pioglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ rosiglitazone            <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ acarbose                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ miglitol                 <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ troglitazone             <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ tolazamide               <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ examide                  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ citoglipton              <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ insulin                  <chr> "No", "Up", "No", "Up", "Steady", "Steady", "…
## $ glyburide.metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glipizide.metformin      <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ glimepiride.pioglitazone <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ metformin.rosiglitazone  <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ metformin.pioglitazone   <chr> "No", "No", "No", "No", "No", "No", "No", "No…
## $ change                   <chr> "No", "Ch", "No", "Ch", "Ch", "No", "Ch", "No…
## $ diabetesMed              <chr> "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes…
## $ readmitted               <chr> "NO", ">30", "NO", "NO", "NO", ">30", "NO", "…
View(Diabetes.df)
#Further data inspection
head(Diabetes.df)
##   encounter_id patient_nbr            race gender     age weight
## 1      2278392     8222157       Caucasian Female  [0-10)      ?
## 2       149190    55629189       Caucasian Female [10-20)      ?
## 3        64410    86047875 AfricanAmerican Female [20-30)      ?
## 4       500364    82442376       Caucasian   Male [30-40)      ?
## 5        16680    42519267       Caucasian   Male [40-50)      ?
## 6        35754    82637451       Caucasian   Male [50-60)      ?
##   admission_type_id discharge_disposition_id admission_source_id
## 1                 6                       25                   1
## 2                 1                        1                   7
## 3                 1                        1                   7
## 4                 1                        1                   7
## 5                 1                        1                   7
## 6                 2                        1                   2
##   time_in_hospital payer_code        medical_specialty num_lab_procedures
## 1                1          ? Pediatrics-Endocrinology                 41
## 2                3          ?                        ?                 59
## 3                2          ?                        ?                 11
## 4                2          ?                        ?                 44
## 5                1          ?                        ?                 51
## 6                3          ?                        ?                 31
##   num_procedures num_medications number_outpatient number_emergency
## 1              0               1                 0                0
## 2              0              18                 0                0
## 3              5              13                 2                0
## 4              1              16                 0                0
## 5              0               8                 0                0
## 6              6              16                 0                0
##   number_inpatient diag_1 diag_2 diag_3 number_diagnoses max_glu_serum
## 1                0 250.83      ?      ?                1          None
## 2                0    276 250.01    255                9          None
## 3                1    648    250    V27                6          None
## 4                0      8 250.43    403                7          None
## 5                0    197    157    250                5          None
## 6                0    414    411    250                9          None
##   A1Cresult metformin repaglinide nateglinide chlorpropamide glimepiride
## 1      None        No          No          No             No          No
## 2      None        No          No          No             No          No
## 3      None        No          No          No             No          No
## 4      None        No          No          No             No          No
## 5      None        No          No          No             No          No
## 6      None        No          No          No             No          No
##   acetohexamide glipizide glyburide tolbutamide pioglitazone rosiglitazone
## 1            No        No        No          No           No            No
## 2            No        No        No          No           No            No
## 3            No    Steady        No          No           No            No
## 4            No        No        No          No           No            No
## 5            No    Steady        No          No           No            No
## 6            No        No        No          No           No            No
##   acarbose miglitol troglitazone tolazamide examide citoglipton insulin
## 1       No       No           No         No      No          No      No
## 2       No       No           No         No      No          No      Up
## 3       No       No           No         No      No          No      No
## 4       No       No           No         No      No          No      Up
## 5       No       No           No         No      No          No  Steady
## 6       No       No           No         No      No          No  Steady
##   glyburide.metformin glipizide.metformin glimepiride.pioglitazone
## 1                  No                  No                       No
## 2                  No                  No                       No
## 3                  No                  No                       No
## 4                  No                  No                       No
## 5                  No                  No                       No
## 6                  No                  No                       No
##   metformin.rosiglitazone metformin.pioglitazone change diabetesMed readmitted
## 1                      No                     No     No          No         NO
## 2                      No                     No     Ch         Yes        >30
## 3                      No                     No     No         Yes         NO
## 4                      No                     No     Ch         Yes         NO
## 5                      No                     No     Ch         Yes         NO
## 6                      No                     No     No         Yes        >30
#Reclassification of numeric categorical variables to factors
Diabetes.df$admission_type_id <- as.factor(Diabetes.df$admission_type_id)
Diabetes.df$discharge_disposition_id <- as.factor(Diabetes.df$discharge_disposition_id)
Diabetes.df$admission_source_id <- as.factor(Diabetes.df$admission_source_id)

Upon inspection, several values were identified as missing with use of the “?” or “Unknown/Invalid” entry. Missing values were managed with mean imputation for numerical variables and mode imputation for categorical variables. Categorical variables were also re-classified as factors.

#Change additional categories to factors
factor_columns_diabetes <- c(3,4, 7, 23:50)
Diabetes.df[factor_columns_diabetes] <- lapply(Diabetes.df[factor_columns_diabetes], function(col) as.factor(as.character(col)))
#Creation of a data dictionary
dict.Diabetes.df <- create_dictionary(Diabetes.df)
## Warning in factor_summary(dataset, column): discharge_disposition_id has more
## than 10 levels, did you want a character variable?
## Warning in factor_summary(dataset, column): admission_source_id has more than
## 10 levels, did you want a character variable?
Diabetes.labels <- c(
  encounter_id = "Patient encounter ID",
  patient_nbr = "Patient ID number",
  race = "Race",
  gender = "Gender",
  age = "Age",
  weight = "Weight",
  admission_type_id = "Admission Type",
  discharge_disposition_id = "Disposition at Discharge",
  admission_source_id = "Source of Admission",
  time_in_hospital = "Time in Hospital",
  payer_code = "Payer Code",
  medical_specialty = "Medical Specialty",
  num_lab_procedures = "Number of Lab Procedures",
  num_procedures = "Number of Procedures",
  num_medications = "Number of Medications",
  number_outpatient = "Number of Outpatient Visits",
  number_emergency = "Number of Emergency Room Visits",
  number_inpatient = "Number of Inpatient Admissions",
  diag_1 = "Diagnosis 1",
  diag_2 = "Dignosis 2",
  diag_3 = "Diagnosis 3",
  number_diagnoses = "Number of Diagnoses",
  max_glu_serum = "Maximum Glucose",
  A1Cresult = "Result or A1C",
  metformin = "Medication yes/no: Metformin",
  repaglinide = "Medication yes/no: Repaglinide",
  nateglinide = "Medication yes/no: Nateglinide",
  chlorpropamide = "Medication yes/no: Chlorpropamide",
  glimepiride = "Medication yes/no: Glimepiride",
  acetohexamide = "Medication yes/no: Acetohexamide",
  glipizide = "Medication yes/no: Glipizide",
  acetohexamide = "Medication yes/no: Acetohexamide",
  glyburide = "Medication yes/no: Glyburide",
  tolbutamide = "Medication yes/no: Tolbutamide",
  pioglitazone = "Medication yes/no: Pioglitazone",
  rosiglitazone = "Medication yes/no: Rosiglitazone",
  acarbose = "Medication yes/no: Acarbose",
  miglitol = "Medication yes/no: Miglitol",
  troglitazone = "Medication yes/no: Troglitazone",
  examide = "Medication yes/no: Examide",
  citoglipton = "Medication yes/no: Citoglipton",
  insulin = "Insulin Use",
  glyburide.metformin = "Medication Combination: Glyburide-Metformin yes/no",
  glipizide.metformin = "Medication Combination: Glipizide-Metformin yes/no",
  glimepiride.pioglitazone = "Medication Combination: Glimepiride-Pioglitazone yes/no",
  metformin.rosiglitazone = "Medication Combination: Metformin-Rosiglitazone yes/no",
  metformin.pioglitazone = "Medication Combination: Metformin-Pioglitazone yes/no",
  change = "Change in Medical Status",
  diabetesMed = "Medicated for Diabetes",
  readmitted = "Readmitted")
create_dictionary(Diabetes.df, var_labels = Diabetes.labels)
## Warning in factor_summary(dataset, column): discharge_disposition_id has more
## than 10 levels, did you want a character variable?
## Warning in factor_summary(dataset, column): admission_source_id has more than
## 10 levels, did you want a character variable?
##                         item
## 1                           
## 2                           
## 3               encounter_id
## 4                           
## 5                           
## 6                           
## 7                           
## 8                patient_nbr
## 9                           
## 10                          
## 11                          
## 12                          
## 13                      race
## 14                          
## 15                          
## 16                          
## 17                          
## 18                          
## 19                          
## 20                    gender
## 21                          
## 22                          
## 23                          
## 24                       age
## 25                          
## 26                    weight
## 27                          
## 28         admission_type_id
## 29                          
## 30                          
## 31                          
## 32                          
## 33                          
## 34                          
## 35                          
## 36                          
## 37  discharge_disposition_id
## 38                          
## 39                          
## 40                          
## 41                          
## 42                          
## 43                          
## 44                          
## 45                          
## 46                          
## 47                          
## 48                          
## 49                          
## 50                          
## 51                          
## 52                          
## 53                          
## 54                          
## 55                          
## 56                          
## 57                          
## 58                          
## 59                          
## 60                          
## 61                          
## 62                          
## 63                          
## 64       admission_source_id
## 65                          
## 66                          
## 67                          
## 68                          
## 69                          
## 70                          
## 71                          
## 72                          
## 73                          
## 74                          
## 75                          
## 76                          
## 77                          
## 78                          
## 79                          
## 80                          
## 81                          
## 82          time_in_hospital
## 83                          
## 84                          
## 85                          
## 86                          
## 87                payer_code
## 88                          
## 89         medical_specialty
## 90                          
## 91        num_lab_procedures
## 92                          
## 93                          
## 94                          
## 95                          
## 96            num_procedures
## 97                          
## 98                          
## 99                          
## 100                         
## 101          num_medications
## 102                         
## 103                         
## 104                         
## 105                         
## 106        number_outpatient
## 107                         
## 108                         
## 109                         
## 110                         
## 111         number_emergency
## 112                         
## 113                         
## 114                         
## 115                         
## 116         number_inpatient
## 117                         
## 118                         
## 119                         
## 120                         
## 121                   diag_1
## 122                         
## 123                   diag_2
## 124                         
## 125                   diag_3
## 126                         
## 127         number_diagnoses
## 128                         
## 129                         
## 130                         
## 131                         
## 132            max_glu_serum
## 133                         
## 134                         
## 135                         
## 136                         
## 137                A1Cresult
## 138                         
## 139                         
## 140                         
## 141                         
## 142                metformin
## 143                         
## 144                         
## 145                         
## 146                         
## 147              repaglinide
## 148                         
## 149                         
## 150                         
## 151                         
## 152              nateglinide
## 153                         
## 154                         
## 155                         
## 156                         
## 157           chlorpropamide
## 158                         
## 159                         
## 160                         
## 161                         
## 162              glimepiride
## 163                         
## 164                         
## 165                         
## 166                         
## 167            acetohexamide
## 168                         
## 169                         
## 170                glipizide
## 171                         
## 172                         
## 173                         
## 174                         
## 175                glyburide
## 176                         
## 177                         
## 178                         
## 179                         
## 180              tolbutamide
## 181                         
## 182                         
## 183             pioglitazone
## 184                         
## 185                         
## 186                         
## 187                         
## 188            rosiglitazone
## 189                         
## 190                         
## 191                         
## 192                         
## 193                 acarbose
## 194                         
## 195                         
## 196                         
## 197                         
## 198                 miglitol
## 199                         
## 200                         
## 201                         
## 202                         
## 203             troglitazone
## 204                         
## 205                         
## 206               tolazamide
## 207                         
## 208                         
## 209                         
## 210                  examide
## 211                         
## 212              citoglipton
## 213                         
## 214                  insulin
## 215                         
## 216                         
## 217                         
## 218                         
## 219      glyburide.metformin
## 220                         
## 221                         
## 222                         
## 223                         
## 224      glipizide.metformin
## 225                         
## 226                         
## 227 glimepiride.pioglitazone
## 228                         
## 229                         
## 230  metformin.rosiglitazone
## 231                         
## 232                         
## 233   metformin.pioglitazone
## 234                         
## 235                         
## 236                   change
## 237                         
## 238                         
## 239              diabetesMed
## 240                         
## 241                         
## 242               readmitted
## 243                         
## 244                         
## 245                         
##                                                       label     class
## 1                                                                    
## 2                                                                    
## 3                                      Patient encounter ID   integer
## 4                                                                    
## 5                                                                    
## 6                                                                    
## 7                                                                    
## 8                                         Patient ID number   integer
## 9                                                                    
## 10                                                                   
## 11                                                                   
## 12                                                                   
## 13                                                     Race    factor
## 14                                                                   
## 15                                                                   
## 16                                                                   
## 17                                                                   
## 18                                                                   
## 19                                                                   
## 20                                                   Gender    factor
## 21                                                                   
## 22                                                                   
## 23                                                                   
## 24                                                      Age character
## 25                                                                   
## 26                                                   Weight character
## 27                                                                   
## 28                                           Admission Type    factor
## 29                                                                   
## 30                                                                   
## 31                                                                   
## 32                                                                   
## 33                                                                   
## 34                                                                   
## 35                                                                   
## 36                                                                   
## 37                                 Disposition at Discharge    factor
## 38                                                                   
## 39                                                                   
## 40                                                                   
## 41                                                                   
## 42                                                                   
## 43                                                                   
## 44                                                                   
## 45                                                                   
## 46                                                                   
## 47                                                                   
## 48                                                                   
## 49                                                                   
## 50                                                                   
## 51                                                                   
## 52                                                                   
## 53                                                                   
## 54                                                                   
## 55                                                                   
## 56                                                                   
## 57                                                                   
## 58                                                                   
## 59                                                                   
## 60                                                                   
## 61                                                                   
## 62                                                                   
## 63                                                                   
## 64                                      Source of Admission    factor
## 65                                                                   
## 66                                                                   
## 67                                                                   
## 68                                                                   
## 69                                                                   
## 70                                                                   
## 71                                                                   
## 72                                                                   
## 73                                                                   
## 74                                                                   
## 75                                                                   
## 76                                                                   
## 77                                                                   
## 78                                                                   
## 79                                                                   
## 80                                                                   
## 81                                                                   
## 82                                         Time in Hospital   integer
## 83                                                                   
## 84                                                                   
## 85                                                                   
## 86                                                                   
## 87                                               Payer Code character
## 88                                                                   
## 89                                        Medical Specialty character
## 90                                                                   
## 91                                 Number of Lab Procedures   integer
## 92                                                                   
## 93                                                                   
## 94                                                                   
## 95                                                                   
## 96                                     Number of Procedures   integer
## 97                                                                   
## 98                                                                   
## 99                                                                   
## 100                                                                  
## 101                                   Number of Medications   integer
## 102                                                                  
## 103                                                                  
## 104                                                                  
## 105                                                                  
## 106                             Number of Outpatient Visits   integer
## 107                                                                  
## 108                                                                  
## 109                                                                  
## 110                                                                  
## 111                         Number of Emergency Room Visits   integer
## 112                                                                  
## 113                                                                  
## 114                                                                  
## 115                                                                  
## 116                          Number of Inpatient Admissions   integer
## 117                                                                  
## 118                                                                  
## 119                                                                  
## 120                                                                  
## 121                                             Diagnosis 1 character
## 122                                                                  
## 123                                              Dignosis 2 character
## 124                                                                  
## 125                                             Diagnosis 3 character
## 126                                                                  
## 127                                     Number of Diagnoses   integer
## 128                                                                  
## 129                                                                  
## 130                                                                  
## 131                                                                  
## 132                                         Maximum Glucose    factor
## 133                                                                  
## 134                                                                  
## 135                                                                  
## 136                                                                  
## 137                                           Result or A1C    factor
## 138                                                                  
## 139                                                                  
## 140                                                                  
## 141                                                                  
## 142                            Medication yes/no: Metformin    factor
## 143                                                                  
## 144                                                                  
## 145                                                                  
## 146                                                                  
## 147                          Medication yes/no: Repaglinide    factor
## 148                                                                  
## 149                                                                  
## 150                                                                  
## 151                                                                  
## 152                          Medication yes/no: Nateglinide    factor
## 153                                                                  
## 154                                                                  
## 155                                                                  
## 156                                                                  
## 157                       Medication yes/no: Chlorpropamide    factor
## 158                                                                  
## 159                                                                  
## 160                                                                  
## 161                                                                  
## 162                          Medication yes/no: Glimepiride    factor
## 163                                                                  
## 164                                                                  
## 165                                                                  
## 166                                                                  
## 167                        Medication yes/no: Acetohexamide    factor
## 168                                                                  
## 169                                                                  
## 170                            Medication yes/no: Glipizide    factor
## 171                                                                  
## 172                                                                  
## 173                                                                  
## 174                                                                  
## 175                        Medication yes/no: Acetohexamide    factor
## 176                                                                  
## 177                                                                  
## 178                                                                  
## 179                                                                  
## 180                            Medication yes/no: Glyburide    factor
## 181                                                                  
## 182                                                                  
## 183                          Medication yes/no: Tolbutamide    factor
## 184                                                                  
## 185                                                                  
## 186                                                                  
## 187                                                                  
## 188                         Medication yes/no: Pioglitazone    factor
## 189                                                                  
## 190                                                                  
## 191                                                                  
## 192                                                                  
## 193                        Medication yes/no: Rosiglitazone    factor
## 194                                                                  
## 195                                                                  
## 196                                                                  
## 197                                                                  
## 198                             Medication yes/no: Acarbose    factor
## 199                                                                  
## 200                                                                  
## 201                                                                  
## 202                                                                  
## 203                             Medication yes/no: Miglitol    factor
## 204                                                                  
## 205                                                                  
## 206                         Medication yes/no: Troglitazone    factor
## 207                                                                  
## 208                                                                  
## 209                                                                  
## 210                              Medication yes/no: Examide    factor
## 211                                                                  
## 212                          Medication yes/no: Citoglipton    factor
## 213                                                                  
## 214                                             Insulin Use    factor
## 215                                                                  
## 216                                                                  
## 217                                                                  
## 218                                                                  
## 219      Medication Combination: Glyburide-Metformin yes/no    factor
## 220                                                                  
## 221                                                                  
## 222                                                                  
## 223                                                                  
## 224      Medication Combination: Glipizide-Metformin yes/no    factor
## 225                                                                  
## 226                                                                  
## 227 Medication Combination: Glimepiride-Pioglitazone yes/no    factor
## 228                                                                  
## 229                                                                  
## 230  Medication Combination: Metformin-Rosiglitazone yes/no    factor
## 231                                                                  
## 232                                                                  
## 233   Medication Combination: Metformin-Pioglitazone yes/no    factor
## 234                                                                  
## 235                                                                  
## 236                                Change in Medical Status    factor
## 237                                                                  
## 238                                                                  
## 239                                  Medicated for Diabetes    factor
## 240                                                                  
## 241                                                                  
## 242                                              Readmitted    factor
## 243                                                                  
## 244                                                                  
## 245                                                                  
##                 summary     value
## 1       Rows in dataset    101766
## 2    Columns in dataset        50
## 3                  mean 165201646
## 4                median 152388987
## 5                   min     12522
## 6                   max 443867222
## 7               missing         0
## 8                  mean  54330401
## 9                median  45505143
## 10                  min       135
## 11                  max 189502619
## 12              missing         0
## 13                ? (1)      2273
## 14  AfricanAmerican (2)     19210
## 15            Asian (3)       641
## 16        Caucasian (4)     76099
## 17         Hispanic (5)      2037
## 18            Other (6)      1506
## 19              missing         0
## 20           Female (1)     54708
## 21             Male (2)     47055
## 22  Unknown/Invalid (3)         3
## 23              missing         0
## 24     unique responses        10
## 25              missing         0
## 26     unique responses        10
## 27              missing         0
## 28                1 (1)     53990
## 29                2 (2)     18480
## 30                3 (3)     18869
## 31                4 (4)        10
## 32                5 (5)      4785
## 33                6 (6)      5291
## 34                7 (7)        21
## 35                8 (8)       320
## 36              missing         0
## 37                1 (1)     60234
## 38                2 (2)      2128
## 39                3 (3)     13954
## 40                4 (4)       815
## 41                5 (5)      1184
## 42                6 (6)     12902
## 43                7 (7)       623
## 44                8 (8)       108
## 45                9 (9)        21
## 46              10 (10)         6
## 47              11 (11)      1642
## 48              12 (12)         3
## 49              13 (13)       399
## 50              14 (14)       372
## 51              15 (15)        63
## 52              16 (16)        11
## 53              17 (17)        14
## 54              18 (18)      3691
## 55              19 (19)         8
## 56              20 (20)         2
## 57              22 (21)      1993
## 58              23 (22)       412
## 59              24 (23)        48
## 60              25 (24)       989
## 61              27 (25)         5
## 62              28 (26)       139
## 63              missing         0
## 64                1 (1)     29565
## 65                2 (2)      1104
## 66                3 (3)       187
## 67                4 (4)      3187
## 68                5 (5)       855
## 69                6 (6)      2264
## 70                7 (7)     57494
## 71                8 (8)        16
## 72                9 (9)       125
## 73              10 (10)         8
## 74              11 (11)         2
## 75              13 (12)         1
## 76              14 (13)         2
## 77              17 (14)      6781
## 78              20 (15)       161
## 79              22 (16)        12
## 80              25 (17)         2
## 81              missing         0
## 82                 mean         4
## 83               median         4
## 84                  min         1
## 85                  max        14
## 86              missing         0
## 87     unique responses        18
## 88              missing         0
## 89     unique responses        73
## 90              missing         0
## 91                 mean        43
## 92               median        44
## 93                  min         1
## 94                  max       132
## 95              missing         0
## 96                 mean         1
## 97               median         1
## 98                  min         0
## 99                  max         6
## 100             missing         0
## 101                mean        16
## 102              median        15
## 103                 min         1
## 104                 max        81
## 105             missing         0
## 106                mean         0
## 107              median         0
## 108                 min         0
## 109                 max        42
## 110             missing         0
## 111                mean         0
## 112              median         0
## 113                 min         0
## 114                 max        76
## 115             missing         0
## 116                mean         1
## 117              median         0
## 118                 min         0
## 119                 max        21
## 120             missing         0
## 121    unique responses       717
## 122             missing         0
## 123    unique responses       749
## 124             missing         0
## 125    unique responses       790
## 126             missing         0
## 127                mean         7
## 128              median         8
## 129                 min         1
## 130                 max        16
## 131             missing         0
## 132            >200 (1)      1485
## 133            >300 (2)      1264
## 134            None (3)     96420
## 135            Norm (4)      2597
## 136             missing         0
## 137              >7 (1)      3812
## 138              >8 (2)      8216
## 139            None (3)     84748
## 140            Norm (4)      4990
## 141             missing         0
## 142            Down (1)       575
## 143              No (2)     81778
## 144          Steady (3)     18346
## 145              Up (4)      1067
## 146             missing         0
## 147            Down (1)        45
## 148              No (2)    100227
## 149          Steady (3)      1384
## 150              Up (4)       110
## 151             missing         0
## 152            Down (1)        11
## 153              No (2)    101063
## 154          Steady (3)       668
## 155              Up (4)        24
## 156             missing         0
## 157            Down (1)         1
## 158              No (2)    101680
## 159          Steady (3)        79
## 160              Up (4)         6
## 161             missing         0
## 162            Down (1)       194
## 163              No (2)     96575
## 164          Steady (3)      4670
## 165              Up (4)       327
## 166             missing         0
## 167              No (1)    101765
## 168          Steady (2)         1
## 169             missing         0
## 170            Down (1)       560
## 171              No (2)     89080
## 172          Steady (3)     11356
## 173              Up (4)       770
## 174             missing         0
## 175            Down (1)       564
## 176              No (2)     91116
## 177          Steady (3)      9274
## 178              Up (4)       812
## 179             missing         0
## 180              No (1)    101743
## 181          Steady (2)        23
## 182             missing         0
## 183            Down (1)       118
## 184              No (2)     94438
## 185          Steady (3)      6976
## 186              Up (4)       234
## 187             missing         0
## 188            Down (1)        87
## 189              No (2)     95401
## 190          Steady (3)      6100
## 191              Up (4)       178
## 192             missing         0
## 193            Down (1)         3
## 194              No (2)    101458
## 195          Steady (3)       295
## 196              Up (4)        10
## 197             missing         0
## 198            Down (1)         5
## 199              No (2)    101728
## 200          Steady (3)        31
## 201              Up (4)         2
## 202             missing         0
## 203              No (1)    101763
## 204          Steady (2)         3
## 205             missing         0
## 206              No (1)    101727
## 207          Steady (2)        38
## 208              Up (3)         1
## 209             missing         0
## 210              No (1)    101766
## 211             missing         0
## 212              No (1)    101766
## 213             missing         0
## 214            Down (1)     12218
## 215              No (2)     47383
## 216          Steady (3)     30849
## 217              Up (4)     11316
## 218             missing         0
## 219            Down (1)         6
## 220              No (2)    101060
## 221          Steady (3)       692
## 222              Up (4)         8
## 223             missing         0
## 224              No (1)    101753
## 225          Steady (2)        13
## 226             missing         0
## 227              No (1)    101765
## 228          Steady (2)         1
## 229             missing         0
## 230              No (1)    101764
## 231          Steady (2)         2
## 232             missing         0
## 233              No (1)    101765
## 234          Steady (2)         1
## 235             missing         0
## 236              Ch (1)     47011
## 237              No (2)     54755
## 238             missing         0
## 239              No (1)     23403
## 240             Yes (2)     78363
## 241             missing         0
## 242             <30 (1)     11357
## 243             >30 (2)     35545
## 244              NO (3)     54864
## 245             missing         0
kable(dict.Diabetes.df, caption = "Data Dictionary")
Data Dictionary
item label class summary value
Rows in dataset 101766
Columns in dataset 50
encounter_id No label integer mean 165201646
median 152388987
min 12522
max 443867222
missing 0
patient_nbr No label integer mean 54330401
median 45505143
min 135
max 189502619
missing 0
race No label factor ? (1) 2273
AfricanAmerican (2) 19210
Asian (3) 641
Caucasian (4) 76099
Hispanic (5) 2037
Other (6) 1506
missing 0
gender No label factor Female (1) 54708
Male (2) 47055
Unknown/Invalid (3) 3
missing 0
age No label character unique responses 10
missing 0
weight No label character unique responses 10
missing 0
admission_type_id No label factor 1 (1) 53990
2 (2) 18480
3 (3) 18869
4 (4) 10
5 (5) 4785
6 (6) 5291
7 (7) 21
8 (8) 320
missing 0
discharge_disposition_id No label factor 1 (1) 60234
2 (2) 2128
3 (3) 13954
4 (4) 815
5 (5) 1184
6 (6) 12902
7 (7) 623
8 (8) 108
9 (9) 21
10 (10) 6
11 (11) 1642
12 (12) 3
13 (13) 399
14 (14) 372
15 (15) 63
16 (16) 11
17 (17) 14
18 (18) 3691
19 (19) 8
20 (20) 2
22 (21) 1993
23 (22) 412
24 (23) 48
25 (24) 989
27 (25) 5
28 (26) 139
missing 0
admission_source_id No label factor 1 (1) 29565
2 (2) 1104
3 (3) 187
4 (4) 3187
5 (5) 855
6 (6) 2264
7 (7) 57494
8 (8) 16
9 (9) 125
10 (10) 8
11 (11) 2
13 (12) 1
14 (13) 2
17 (14) 6781
20 (15) 161
22 (16) 12
25 (17) 2
missing 0
time_in_hospital No label integer mean 4
median 4
min 1
max 14
missing 0
payer_code No label character unique responses 18
missing 0
medical_specialty No label character unique responses 73
missing 0
num_lab_procedures No label integer mean 43
median 44
min 1
max 132
missing 0
num_procedures No label integer mean 1
median 1
min 0
max 6
missing 0
num_medications No label integer mean 16
median 15
min 1
max 81
missing 0
number_outpatient No label integer mean 0
median 0
min 0
max 42
missing 0
number_emergency No label integer mean 0
median 0
min 0
max 76
missing 0
number_inpatient No label integer mean 1
median 0
min 0
max 21
missing 0
diag_1 No label character unique responses 717
missing 0
diag_2 No label character unique responses 749
missing 0
diag_3 No label character unique responses 790
missing 0
number_diagnoses No label integer mean 7
median 8
min 1
max 16
missing 0
max_glu_serum No label factor >200 (1) 1485
>300 (2) 1264
None (3) 96420
Norm (4) 2597
missing 0
A1Cresult No label factor >7 (1) 3812
>8 (2) 8216
None (3) 84748
Norm (4) 4990
missing 0
metformin No label factor Down (1) 575
No (2) 81778
Steady (3) 18346
Up (4) 1067
missing 0
repaglinide No label factor Down (1) 45
No (2) 100227
Steady (3) 1384
Up (4) 110
missing 0
nateglinide No label factor Down (1) 11
No (2) 101063
Steady (3) 668
Up (4) 24
missing 0
chlorpropamide No label factor Down (1) 1
No (2) 101680
Steady (3) 79
Up (4) 6
missing 0
glimepiride No label factor Down (1) 194
No (2) 96575
Steady (3) 4670
Up (4) 327
missing 0
acetohexamide No label factor No (1) 101765
Steady (2) 1
missing 0
glipizide No label factor Down (1) 560
No (2) 89080
Steady (3) 11356
Up (4) 770
missing 0
glyburide No label factor Down (1) 564
No (2) 91116
Steady (3) 9274
Up (4) 812
missing 0
tolbutamide No label factor No (1) 101743
Steady (2) 23
missing 0
pioglitazone No label factor Down (1) 118
No (2) 94438
Steady (3) 6976
Up (4) 234
missing 0
rosiglitazone No label factor Down (1) 87
No (2) 95401
Steady (3) 6100
Up (4) 178
missing 0
acarbose No label factor Down (1) 3
No (2) 101458
Steady (3) 295
Up (4) 10
missing 0
miglitol No label factor Down (1) 5
No (2) 101728
Steady (3) 31
Up (4) 2
missing 0
troglitazone No label factor No (1) 101763
Steady (2) 3
missing 0
tolazamide No label factor No (1) 101727
Steady (2) 38
Up (3) 1
missing 0
examide No label factor No (1) 101766
missing 0
citoglipton No label factor No (1) 101766
missing 0
insulin No label factor Down (1) 12218
No (2) 47383
Steady (3) 30849
Up (4) 11316
missing 0
glyburide.metformin No label factor Down (1) 6
No (2) 101060
Steady (3) 692
Up (4) 8
missing 0
glipizide.metformin No label factor No (1) 101753
Steady (2) 13
missing 0
glimepiride.pioglitazone No label factor No (1) 101765
Steady (2) 1
missing 0
metformin.rosiglitazone No label factor No (1) 101764
Steady (2) 2
missing 0
metformin.pioglitazone No label factor No (1) 101765
Steady (2) 1
missing 0
change No label factor Ch (1) 47011
No (2) 54755
missing 0
diabetesMed No label factor No (1) 23403
Yes (2) 78363
missing 0
readmitted No label factor <30 (1) 11357
>30 (2) 35545
NO (3) 54864
missing 0
#Management of missing values: Missing values are currently listed as "?" or "Unknown/Invalid" and require conversion to NA
Diabetes.df[Diabetes.df == "?"] <- NA
Diabetes.df[Diabetes.df == "Unknown/Invalid"] <- NA
# Count rows with NA
rows_NA <- sum(rowSums(is.na(Diabetes.df)) > 0)
percent_row_NA <- percent(rows_NA/nrow(Diabetes.df))
rows_NA
## [1] 100723
percent_row_NA
## [1] "99%"
# Count columns with NA
col_NA <- sum(colSums(is.na(Diabetes.df)) > 0)
percent_col_NA <- percent(col_NA/ncol(Diabetes.df))
col_NA
## [1] 8
percent_col_NA
## [1] "16%"
mismapDiabetes <- missmap(Diabetes.df)

Diabetes.df <- lapply(Diabetes.df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

Diabetes.df <- as.data.frame(Diabetes.df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(Diabetes.df)
##   encounter_id        patient_nbr                     race      
##  Min.   :    12522   Min.   :      135   ?              :    0  
##  1st Qu.: 84961194   1st Qu.: 23413221   AfricanAmerican:19210  
##  Median :152388987   Median : 45505143   Asian          :  641  
##  Mean   :165201646   Mean   : 54330401   Caucasian      :78372  
##  3rd Qu.:230270888   3rd Qu.: 87545950   Hispanic       : 2037  
##  Max.   :443867222   Max.   :189502619   Other          : 1506  
##                                                                 
##              gender          age               weight         
##  Female         :54711   Length:101766      Length:101766     
##  Male           :47055   Class :character   Class :character  
##  Unknown/Invalid:    0   Mode  :character   Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##  admission_type_id discharge_disposition_id admission_source_id
##  1      :53990     1      :60234            7      :57494      
##  3      :18869     3      :13954            1      :29565      
##  2      :18480     6      :12902            17     : 6781      
##  6      : 5291     18     : 3691            4      : 3187      
##  5      : 4785     2      : 2128            6      : 2264      
##  8      :  320     22     : 1993            2      : 1104      
##  (Other):   31     (Other): 6864            (Other): 1371      
##  time_in_hospital  payer_code        medical_specialty  num_lab_procedures
##  Min.   : 1.000   Length:101766      Length:101766      Min.   :  1.0     
##  1st Qu.: 2.000   Class :character   Class :character   1st Qu.: 31.0     
##  Median : 4.000   Mode  :character   Mode  :character   Median : 44.0     
##  Mean   : 4.396                                         Mean   : 43.1     
##  3rd Qu.: 6.000                                         3rd Qu.: 57.0     
##  Max.   :14.000                                         Max.   :132.0     
##                                                                           
##  num_procedures num_medications number_outpatient number_emergency 
##  Min.   :0.00   Min.   : 1.00   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.:0.00   1st Qu.:10.00   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median :1.00   Median :15.00   Median : 0.0000   Median : 0.0000  
##  Mean   :1.34   Mean   :16.02   Mean   : 0.3694   Mean   : 0.1978  
##  3rd Qu.:2.00   3rd Qu.:20.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :6.00   Max.   :81.00   Max.   :42.0000   Max.   :76.0000  
##                                                                    
##  number_inpatient     diag_1             diag_2             diag_3         
##  Min.   : 0.0000   Length:101766      Length:101766      Length:101766     
##  1st Qu.: 0.0000   Class :character   Class :character   Class :character  
##  Median : 0.0000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 0.6356                                                           
##  3rd Qu.: 1.0000                                                           
##  Max.   :21.0000                                                           
##                                                                            
##  number_diagnoses max_glu_serum A1Cresult     metformin     repaglinide    
##  Min.   : 1.000   >200: 1485    >7  : 3812   Down  :  575   Down  :    45  
##  1st Qu.: 6.000   >300: 1264    >8  : 8216   No    :81778   No    :100227  
##  Median : 8.000   None:96420    None:84748   Steady:18346   Steady:  1384  
##  Mean   : 7.423   Norm: 2597    Norm: 4990   Up    : 1067   Up    :   110  
##  3rd Qu.: 9.000                                                            
##  Max.   :16.000                                                            
##                                                                            
##  nateglinide     chlorpropamide  glimepiride    acetohexamide    glipizide    
##  Down  :    11   Down  :     1   Down  :  194   No    :101765   Down  :  560  
##  No    :101063   No    :101680   No    :96575   Steady:     1   No    :89080  
##  Steady:   668   Steady:    79   Steady: 4670                   Steady:11356  
##  Up    :    24   Up    :     6   Up    :  327                   Up    :  770  
##                                                                               
##                                                                               
##                                                                               
##   glyburide     tolbutamide     pioglitazone   rosiglitazone    acarbose     
##  Down  :  564   No    :101743   Down  :  118   Down  :   87   Down  :     3  
##  No    :91116   Steady:    23   No    :94438   No    :95401   No    :101458  
##  Steady: 9274                   Steady: 6976   Steady: 6100   Steady:   295  
##  Up    :  812                   Up    :  234   Up    :  178   Up    :    10  
##                                                                              
##                                                                              
##                                                                              
##    miglitol      troglitazone     tolazamide     examide     citoglipton
##  Down  :     5   No    :101763   No    :101727   No:101766   No:101766  
##  No    :101728   Steady:     3   Steady:    38                          
##  Steady:    31                   Up    :     1                          
##  Up    :     2                                                          
##                                                                         
##                                                                         
##                                                                         
##    insulin      glyburide.metformin glipizide.metformin
##  Down  :12218   Down  :     6       No    :101753      
##  No    :47383   No    :101060       Steady:    13      
##  Steady:30849   Steady:   692                          
##  Up    :11316   Up    :     8                          
##                                                        
##                                                        
##                                                        
##  glimepiride.pioglitazone metformin.rosiglitazone metformin.pioglitazone
##  No    :101765            No    :101764           No    :101765         
##  Steady:     1            Steady:     2           Steady:     1         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  change     diabetesMed readmitted 
##  Ch:47011   No :23403   <30:11357  
##  No:54755   Yes:78363   >30:35545  
##                         NO :54864  
##                                    
##                                    
##                                    
## 
str(Diabetes.df)
## 'data.frame':    101766 obs. of  50 variables:
##  $ encounter_id            : num  2278392 149190 64410 500364 16680 ...
##  $ patient_nbr             : num  8222157 55629189 86047875 82442376 42519267 ...
##  $ race                    : Factor w/ 6 levels "?","AfricanAmerican",..: 4 4 2 4 4 4 4 4 4 4 ...
##  $ gender                  : Factor w/ 3 levels "Female","Male",..: 1 1 1 2 2 2 2 2 1 1 ...
##  $ age                     : chr  "[0-10)" "[10-20)" "[20-30)" "[30-40)" ...
##  $ weight                  : chr  "NA" "NA" "NA" "NA" ...
##  $ admission_type_id       : Factor w/ 8 levels "1","2","3","4",..: 6 1 1 1 1 2 3 1 2 3 ...
##  $ discharge_disposition_id: Factor w/ 26 levels "1","2","3","4",..: 24 1 1 1 1 1 1 1 1 3 ...
##  $ admission_source_id     : Factor w/ 17 levels "1","2","3","4",..: 1 7 7 7 7 2 2 7 4 4 ...
##  $ time_in_hospital        : num  1 3 2 2 1 3 4 5 13 12 ...
##  $ payer_code              : chr  "NA" "NA" "NA" "NA" ...
##  $ medical_specialty       : chr  "Pediatrics-Endocrinology" "NA" "NA" "NA" ...
##  $ num_lab_procedures      : num  41 59 11 44 51 31 70 73 68 33 ...
##  $ num_procedures          : num  0 0 5 1 0 6 1 0 2 3 ...
##  $ num_medications         : num  1 18 13 16 8 16 21 12 28 18 ...
##  $ number_outpatient       : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ number_emergency        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ number_inpatient        : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ diag_1                  : chr  "250.83" "276" "648" "8" ...
##  $ diag_2                  : chr  "NA" "250.01" "250" "250.43" ...
##  $ diag_3                  : chr  "NA" "255" "V27" "403" ...
##  $ number_diagnoses        : num  1 9 6 7 5 9 7 8 8 8 ...
##  $ max_glu_serum           : Factor w/ 4 levels ">200",">300",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1Cresult               : Factor w/ 4 levels ">7",">8","None",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ metformin               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ repaglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ nateglinide             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ chlorpropamide          : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glimepiride             : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 3 2 2 2 ...
##  $ acetohexamide           : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glipizide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 3 2 3 2 2 2 3 2 ...
##  $ glyburide               : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 3 2 2 ...
##  $ tolbutamide             : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pioglitazone            : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ rosiglitazone           : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 3 ...
##  $ acarbose                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ miglitol                : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ troglitazone            : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ tolazamide              : Factor w/ 3 levels "No","Steady",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ examide                 : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ citoglipton             : Factor w/ 1 level "No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ insulin                 : Factor w/ 4 levels "Down","No","Steady",..: 2 4 2 4 3 3 3 2 3 3 ...
##  $ glyburide.metformin     : Factor w/ 4 levels "Down","No","Steady",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ glipizide.metformin     : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ glimepiride.pioglitazone: Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.rosiglitazone : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metformin.pioglitazone  : Factor w/ 2 levels "No","Steady": 1 1 1 1 1 1 1 1 1 1 ...
##  $ change                  : Factor w/ 2 levels "Ch","No": 2 1 2 1 1 2 1 2 1 1 ...
##  $ diabetesMed             : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ readmitted              : Factor w/ 3 levels "<30",">30","NO": 3 2 3 3 3 2 3 2 3 3 ...

Statistical measures were obtained from the numerical variables of the full dataset, and of each of the sampling methods. Measures derived from numerical variables for the main dataset and the three sampled datasets included mean, median, standard deviation, variance, range, IQR, skewness, and kurtosis. Although no missing values were anticipated due to pre-processing, missing values were removed in the function as further assurance in obtaining statistical measures.

Three types of sampling methods were used for comparison. A random sample of 1000 random observations were obtained for the first sample. To create a sample of 900 observations for the stratified sample, 100 observations were taken from the age categories (0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70, 70-80, 80-90). Finally, a systematic sample was created by obtaining the observation for every kth observation by dividing the total number of observations (N) by the number of observations in the sample (n = 1000).

Data visualization was completed with histogram plots for the numeric variables: number of lab procedures, number of procedures, and number of medications. Histogram plots were created for the full dataset and each sample method. Additionally, Quantile-Quantile plots were created for the variables analyzed and compared to a theoretical distribution.

# Compute descriptive statistics on numeric values
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Variance = round(var(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

descriptive_stats <- do.call(
  rbind,
  lapply(names(Diabetes.df), function(col) compute_stats(Diabetes.df[[col]], col))
)

descriptive_stats
##              Variable         Mean    Median St.Deviation     Variance
## 1        encounter_id 165201645.62 152388987 102640295.98 1.053503e+16
## 2         patient_nbr  54330400.69  45505143  38696359.35 1.497408e+15
## 3    time_in_hospital         4.40         4         2.99 8.910000e+00
## 4  num_lab_procedures        43.10        44        19.67 3.870800e+02
## 5      num_procedures         1.34         1         1.71 2.910000e+00
## 6     num_medications        16.02        15         8.13 6.606000e+01
## 7   number_outpatient         0.37         0         1.27 1.610000e+00
## 8    number_emergency         0.20         0         0.93 8.700000e-01
## 9    number_inpatient         0.64         0         1.26 1.590000e+00
## 10   number_diagnoses         7.42         8         1.93 3.740000e+00
##        Range       IQR Skewness Kurtosis
## 1  443854700 145309694     0.70     2.90
## 2  189502484  64132729     0.47     2.65
## 3         13         4     1.13     3.85
## 4        131        26    -0.24     2.75
## 5          6         2     1.32     3.86
## 6         80        10     1.33     6.47
## 7         42         0     8.83   150.90
## 8         76         0    22.86  1194.63
## 9         21         1     3.61    23.72
## 10        15         3    -0.88     2.92
col = "num_lab_procedures"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(Diabetes.df$num_lab_procedures, col = "red", lwd = 2, lty = 2)

col = "num_procedures"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 5, fill = "blue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_procedures, main = 'Q-Q Plot for Right Skewed', col = "steelblue")
qqline(Diabetes.df$num_procedures, col = "red", lwd = 2, lty = 2)

col = "num_medications"
if (is.numeric(Diabetes.df[[col]]) || is.integer(Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(Diabetes.df$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(Diabetes.df$num_medications, col = "red", lwd = 2, lty = 2)

Diabetes.df.num <- Diabetes.df[c(13:18,22)]
head(Diabetes.df.num)
##   num_lab_procedures num_procedures num_medications number_outpatient
## 1                 41              0               1                 0
## 2                 59              0              18                 0
## 3                 11              5              13                 2
## 4                 44              1              16                 0
## 5                 51              0               8                 0
## 6                 31              6              16                 0
##   number_emergency number_inpatient number_diagnoses
## 1                0                0                1
## 2                0                0                9
## 3                0                1                6
## 4                0                0                7
## 5                0                0                5
## 6                0                0                9
descriptive_stats_diabetes_num <- do.call(
  rbind,
  lapply(names(Diabetes.df.num), function(col) compute_stats(Diabetes.df.num[[col]], col))
)

descriptive_stats_diabetes_num
##             Variable  Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.10     44        19.67   387.08   131  26    -0.24
## 2     num_procedures  1.34      1         1.71     2.91     6   2     1.32
## 3    num_medications 16.02     15         8.13    66.06    80  10     1.33
## 4  number_outpatient  0.37      0         1.27     1.61    42   0     8.83
## 5   number_emergency  0.20      0         0.93     0.87    76   0    22.86
## 6   number_inpatient  0.64      0         1.26     1.59    21   1     3.61
## 7   number_diagnoses  7.42      8         1.93     3.74    15   3    -0.88
##   Kurtosis
## 1     2.75
## 2     3.86
## 3     6.47
## 4   150.90
## 5  1194.63
## 6    23.72
## 7     2.92
descriptive_stats_diabetes_num %>%
  kbl(caption = "Descriptive Statistics - Numeric Variables") %>%
  kable_classic()
Descriptive Statistics - Numeric Variables
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
num_lab_procedures 43.10 44 19.67 387.08 131 26 -0.24 2.75
num_procedures 1.34 1 1.71 2.91 6 2 1.32 3.86
num_medications 16.02 15 8.13 66.06 80 10 1.33 6.47
number_outpatient 0.37 0 1.27 1.61 42 0 8.83 150.90
number_emergency 0.20 0 0.93 0.87 76 0 22.86 1194.63
number_inpatient 0.64 0 1.26 1.59 21 1 3.61 23.72
number_diagnoses 7.42 8 1.93 3.74 15 3 -0.88 2.92
set.seed(123)
sample_1_Diabetes.df.num <- Diabetes.df.num[sample(nrow(Diabetes.df.num), 1000), ]
head(sample_1_Diabetes.df.num)
##       num_lab_procedures num_procedures num_medications number_outpatient
## 51663                 61              1               9                 0
## 57870                 64              1              20                 0
## 2986                  68              0               8                 0
## 29925                 46              5               9                 0
## 95246                 49              4              15                 1
## 68293                 47              4              30                 0
##       number_emergency number_inpatient number_diagnoses
## 51663                0                0                9
## 57870                0                1                9
## 2986                 0                0                8
## 29925                0                0                4
## 95246                0                4                9
## 68293                0                0                9
descriptive_stats_sample_1 <- do.call(
  rbind,
  lapply(names(sample_1_Diabetes.df.num), function(col) compute_stats(sample_1_Diabetes.df.num[[col]], col))
)

descriptive_stats_sample_1
##             Variable  Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.20     44        19.87   394.78   104  26    -0.17
## 2     num_procedures  1.34      1         1.74     3.02     6   2     1.33
## 3    num_medications 15.82     15         7.91    62.56    67  10     1.20
## 4  number_outpatient  0.36      0         1.26     1.60    21   0     7.35
## 5   number_emergency  0.23      0         0.94     0.88    16   0     8.80
## 6   number_inpatient  0.70      0         1.31     1.72    15   1     3.59
## 7   number_diagnoses  7.43      8         1.86     3.46     8   3    -0.84
##   Kurtosis
## 1     2.79
## 2     3.82
## 3     5.98
## 4    88.52
## 5   115.68
## 6    25.23
## 7     2.53
descriptive_stats_sample_1 %>%
  kable(caption = "Descriptive Statistics - Random Sample") %>%
  kable_classic()
Descriptive Statistics - Random Sample
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
num_lab_procedures 43.20 44 19.87 394.78 104 26 -0.17 2.79
num_procedures 1.34 1 1.74 3.02 6 2 1.33 3.82
num_medications 15.82 15 7.91 62.56 67 10 1.20 5.98
number_outpatient 0.36 0 1.26 1.60 21 0 7.35 88.52
number_emergency 0.23 0 0.94 0.88 16 0 8.80 115.68
number_inpatient 0.70 0 1.31 1.72 15 1 3.59 25.23
number_diagnoses 7.43 8 1.86 3.46 8 3 -0.84 2.53
col = "num_lab_procedures"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_lab_procedures, col = "red", lwd = 2, lty = 2)

col = "num_procedures"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
    geom_histogram(bins = 5, fill = "blue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_procedures, col = "red", lwd = 2, lty = 2)

col = "num_medications"
if (is.numeric(sample_1_Diabetes.df.num[[col]]) || is.integer(sample_1_Diabetes.df.num[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_1_Diabetes.df.num, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes.sample.1 <- rnorm(100)
qqnorm(sample_1_Diabetes.df.num$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_1_Diabetes.df.num$num_medications, col = "red", lwd = 2, lty = 2)

set.seed(456)
sample_2_Diabetes.df <- Diabetes.df %>%
  group_by(age) %>%
  sample_n(100)
head(sample_2_Diabetes.df)
## # A tibble: 6 × 50
## # Groups:   age [1]
##   encounter_id patient_nbr race            gender age   weight admission_type_id
##          <dbl>       <dbl> <fct>           <fct>  <chr> <chr>  <fct>            
## 1     25121856    16295499 AfricanAmerican Female [0-1… NA     3                
## 2     18757914     7691157 AfricanAmerican Male   [0-1… NA     3                
## 3     18298074     5236596 Caucasian       Male   [0-1… [0-25) 6                
## 4     81214752    21073491 Caucasian       Male   [0-1… NA     1                
## 5     52344870    18844713 Caucasian       Female [0-1… NA     1                
## 6     59111826     2048598 Caucasian       Male   [0-1… NA     1                
## # ℹ 43 more variables: discharge_disposition_id <fct>,
## #   admission_source_id <fct>, time_in_hospital <dbl>, payer_code <chr>,
## #   medical_specialty <chr>, num_lab_procedures <dbl>, num_procedures <dbl>,
## #   num_medications <dbl>, number_outpatient <dbl>, number_emergency <dbl>,
## #   number_inpatient <dbl>, diag_1 <chr>, diag_2 <chr>, diag_3 <chr>,
## #   number_diagnoses <dbl>, max_glu_serum <fct>, A1Cresult <fct>,
## #   metformin <fct>, repaglinide <fct>, nateglinide <fct>, …
descriptive_stats_sample_2 <- do.call(
  rbind,
  lapply(names(sample_2_Diabetes.df), function(col) compute_stats(sample_2_Diabetes.df[[col]], col))
)

descriptive_stats_sample_2
##              Variable        Mean    Median St.Deviation     Variance     Range
## 1        encounter_id 1.54239e+08 145628052 102684709.86 1.054415e+16 441244556
## 2         patient_nbr 5.05900e+07  42175899  39314511.59 1.545631e+15 188223845
## 3    time_in_hospital 4.03000e+00         3         2.80 7.830000e+00        13
## 4  num_lab_procedures 4.31800e+01        45        19.08 3.640900e+02        96
## 5      num_procedures 1.01000e+00         0         1.52 2.320000e+00         6
## 6     num_medications 1.38700e+01        12         8.49 7.200000e+01        74
## 7   number_outpatient 2.70000e-01         0         0.96 9.200000e-01        15
## 8    number_emergency 2.20000e-01         0         0.86 7.500000e-01        13
## 9    number_inpatient 6.70000e-01         0         1.64 2.690000e+00        13
## 10   number_diagnoses 6.39000e+00         7         2.60 6.740000e+00        15
##             IQR Skewness Kurtosis
## 1  138425463.00     0.83     3.22
## 2   66921156.00     0.62     2.77
## 3          3.00     1.29     4.35
## 4         25.00    -0.25     2.76
## 5          1.00     1.73     5.39
## 6         10.00     1.78     9.96
## 7          0.00     6.66    73.29
## 8          0.00     7.01    72.49
## 9          1.00     4.21    25.39
## 10         4.25    -0.48     2.12
descriptive_stats_sample_2 %>%
  kbl(caption = "Descriptive Statistics - Stratified Sample based on Age") %>%
  kable_classic()
Descriptive Statistics - Stratified Sample based on Age
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
encounter_id 1.54239e+08 145628052 102684709.86 1.054415e+16 441244556 138425463.00 0.83 3.22
patient_nbr 5.05900e+07 42175899 39314511.59 1.545631e+15 188223845 66921156.00 0.62 2.77
time_in_hospital 4.03000e+00 3 2.80 7.830000e+00 13 3.00 1.29 4.35
num_lab_procedures 4.31800e+01 45 19.08 3.640900e+02 96 25.00 -0.25 2.76
num_procedures 1.01000e+00 0 1.52 2.320000e+00 6 1.00 1.73 5.39
num_medications 1.38700e+01 12 8.49 7.200000e+01 74 10.00 1.78 9.96
number_outpatient 2.70000e-01 0 0.96 9.200000e-01 15 0.00 6.66 73.29
number_emergency 2.20000e-01 0 0.86 7.500000e-01 13 0.00 7.01 72.49
number_inpatient 6.70000e-01 0 1.64 2.690000e+00 13 1.00 4.21 25.39
number_diagnoses 6.39000e+00 7 2.60 6.740000e+00 15 4.25 -0.48 2.12
sample_2_num <- sample_2_Diabetes.df[c(13:18,22)]
head(sample_2_num)
## # A tibble: 6 × 7
##   num_lab_procedures num_procedures num_medications number_outpatient
##                <dbl>          <dbl>           <dbl>             <dbl>
## 1                 21              0               4                 0
## 2                 39              3              19                 0
## 3                 37              0               5                 0
## 4                 35              0               4                 0
## 5                 42              0               5                 0
## 6                 42              0              10                 0
## # ℹ 3 more variables: number_emergency <dbl>, number_inpatient <dbl>,
## #   number_diagnoses <dbl>
descriptive_stats_sample_2.2 <- do.call(
  rbind,
  lapply(names(sample_2_num), function(col) compute_stats(sample_2_num[[col]], col))
)

descriptive_stats_sample_2.2
##             Variable  Mean Median St.Deviation Variance Range   IQR Skewness
## 1 num_lab_procedures 43.18     45        19.08   364.09    96 25.00    -0.25
## 2     num_procedures  1.01      0         1.52     2.32     6  1.00     1.73
## 3    num_medications 13.87     12         8.49    72.00    74 10.00     1.78
## 4  number_outpatient  0.27      0         0.96     0.92    15  0.00     6.66
## 5   number_emergency  0.22      0         0.86     0.75    13  0.00     7.01
## 6   number_inpatient  0.67      0         1.64     2.69    13  1.00     4.21
## 7   number_diagnoses  6.39      7         2.60     6.74    15  4.25    -0.48
##   Kurtosis
## 1     2.76
## 2     5.39
## 3     9.96
## 4    73.29
## 5    72.49
## 6    25.39
## 7     2.12
descriptive_stats_sample_2.2 %>%
  kbl(caption = "Descriptive Statistics - Sratified Sample based on Age") %>%
  kable_classic()
Descriptive Statistics - Sratified Sample based on Age
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
num_lab_procedures 43.18 45 19.08 364.09 96 25.00 -0.25 2.76
num_procedures 1.01 0 1.52 2.32 6 1.00 1.73 5.39
num_medications 13.87 12 8.49 72.00 74 10.00 1.78 9.96
number_outpatient 0.27 0 0.96 0.92 15 0.00 6.66 73.29
number_emergency 0.22 0 0.86 0.75 13 0.00 7.01 72.49
number_inpatient 0.67 0 1.64 2.69 13 1.00 4.21 25.39
number_diagnoses 6.39 7 2.60 6.74 15 4.25 -0.48 2.12
col = "num_lab_procedures"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(sample_2_Diabetes.df$num_lab_procedures, col = "red", lwd = 2, lty = 2)

col = "num_procedures"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 5, fill = "blue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_2_Diabetes.df$num_procedures, col = "red", lwd = 2, lty = 2)

col = "num_medications"
if (is.numeric(sample_2_Diabetes.df[[col]]) || is.integer(sample_2_Diabetes.df[[col]])) {
  # Histogram for numeric variables
  ggplot(sample_2_Diabetes.df, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(sample_2_Diabetes.df$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(sample_2_Diabetes.df$num_medications, col = "red", lwd = 2, lty = 2)

N <- nrow(Diabetes.df.num)
n <- 1000
k <- floor(N/n)

set.seed(123)
start <- sample(1:k, 1)

systematic_indices_diabetes <- seq(from = start, to = N, by = k)
systematic_sample_3_diabetes <- Diabetes.df.num[systematic_indices_diabetes, ]
head(systematic_sample_3_diabetes)
##     num_lab_procedures num_procedures num_medications number_outpatient
## 31                  87              0              18                 0
## 132                 34              1              19                 0
## 233                 48              1              10                 0
## 334                  3              0              13                 0
## 435                 41              0               9                 0
## 536                 32              0               6                 0
##     number_emergency number_inpatient number_diagnoses
## 31                 0                0                9
## 132                0                0                4
## 233                0                1                5
## 334                0                0                9
## 435                0                0                5
## 536                0                0                4
descriptive_stats_sample_3 <- do.call(
  rbind,
  lapply(names(systematic_sample_3_diabetes), function(col) compute_stats(systematic_sample_3_diabetes[[col]], col))
)

descriptive_stats_sample_3
##             Variable  Mean Median St.Deviation Variance Range IQR Skewness
## 1 num_lab_procedures 43.29   44.5        19.52   381.08   101  25    -0.28
## 2     num_procedures  1.36    1.0         1.73     3.00     6   2     1.26
## 3    num_medications 16.33   15.0         8.20    67.22    67  10     1.45
## 4  number_outpatient  0.44    0.0         1.26     1.60    11   0     4.28
## 5   number_emergency  0.20    0.0         0.70     0.50     8   0     5.32
## 6   number_inpatient  0.69    0.0         1.46     2.14    15   1     3.88
## 7   number_diagnoses  7.41    8.0         1.95     3.80     8   3    -0.95
##   Kurtosis
## 1     2.76
## 2     3.60
## 3     7.96
## 4    25.16
## 5    39.73
## 6    25.08
## 7     2.83
descriptive_stats_sample_3 %>%
  kbl(caption = "Descriptive Statistics - Systematic Sample") %>%
  kable_classic()
Descriptive Statistics - Systematic Sample
Variable Mean Median St.Deviation Variance Range IQR Skewness Kurtosis
num_lab_procedures 43.29 44.5 19.52 381.08 101 25 -0.28 2.76
num_procedures 1.36 1.0 1.73 3.00 6 2 1.26 3.60
num_medications 16.33 15.0 8.20 67.22 67 10 1.45 7.96
number_outpatient 0.44 0.0 1.26 1.60 11 0 4.28 25.16
number_emergency 0.20 0.0 0.70 0.50 8 0 5.32 39.73
number_inpatient 0.69 0.0 1.46 2.14 15 1 3.88 25.08
number_diagnoses 7.41 8.0 1.95 3.80 8 3 -0.95 2.83
col = "num_lab_procedures"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
  # Histogram for numeric variables
  ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkviolet", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_lab_procedures, main = 'Q-Q Plot for Normality', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_lab_procedures, col = "red", lwd = 2, lty = 2)

col = "num_procedures"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
  # Histogram for numeric variables
  ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
    geom_histogram(bins = 5, fill = "blue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_procedures, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_procedures, col = "red", lwd = 2, lty = 2)

col = "num_medications"
if (is.numeric(systematic_sample_3_diabetes[[col]]) || is.integer(systematic_sample_3_diabetes[[col]])) {
  # Histogram for numeric variables
  ggplot(systematic_sample_3_diabetes, aes(x = .data[[col]])) +
    geom_histogram(bins = 15, fill = "darkgreen", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

set.seed(123)
data.diabetes <- rnorm(100)
qqnorm(systematic_sample_3_diabetes$num_medications, main = 'Q-Q Plot for Right-Skewed', col = "steelblue")
qqline(systematic_sample_3_diabetes$num_medications, col = "red", lwd = 2, lty = 2)

The measures of central tendency, including mean and median results were comparable across the three sampling methods for all numeric variables. The three sampling methods were not comparable for skewness and kurtosis for the number of outpatient visits and emergency visits. This is likely due to the outliers for both of these variables. For the number of outpatient visits, the mean is 0.3694 and the 3rd quartile is 0.000, but the maximum is 42 indicating at least one if not several outliers. Also, the mean for number of emergency visits is 0.1978 and the 3rd quartile is 0.00, but the maximum value is 76, also indicating at least one if not several outliers. Further investigation of outliers may indicate that sample statistics are more representative of the dataset than using the full dataset with outliers.

Histogram comparisons for the full dataset and the three sampled groups demonstrate a similar shape. For the variable: number of lab procedures, the variable has a normal distribution with a heavy left tail. This indicates that the three sampling methods provided a reasonably similar representation of the data.

A comparison of the Quantile-Quantile (Q-Q) plots for the full dataset and each of the samples obtained indicates a similar distribution. The heavy left tail is evident in the Q-Q plot for each of the samples.

Data analysis with the 130 Hospitals Diabetes dataset improved in efficiency without sacrificing reliability with use of three sampling methods, random sampling, stratified sampling and systematic sampling. Similar value distribution was observed across histogram plots and Q-Q plots. This could be partially due to the inherent feature selection by eliminating categorical variables and using only numerical variables for analysis. Use of numeric variables reduced the volume of features from 50 to 7. Additionally, sampling methods reduced the number of observations analyzed from over 100,000 to 1000 while obtaining comparable statistics. Strategic sampling of high-dimensionality datasets is a valuable method for compressing dimensionality for model efficiency.

Healthcare datasets provide opportunities to develop machine learning models for disease or admission prediction; however, high-dimensionality creates a challenge in model development and performance. Both sampling methods and feature selection algorithms provide methods to reduce the impact of the curse of dimensionality. The 130 Hospitals Diabetes dataset was used in this project to demonstrate the utility of sampling methods in reducing dimensionality for model performance. Sampling methods are valuable options for developing optimally performing models using high-dimensionality datasets.

References World Health Organization. (2024, November 14). Diabetes. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/diabetes