Part I: Research Question
Which variables are the most important in regards to predicting which patients are at a high risk of readmission?
Variables:
library(readr)
df_raw <- read.csv('~/Documents/WGU MSDA/D206/medical_raw_data.csv')
str(df_raw)
'data.frame': 10000 obs. of 53 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ CaseOrder : int 1 2 3 4 5 6 7 8 9 10 ...
$ Customer_id : chr "C412403" "Z919181" "F995323" "A879973" ...
$ Interaction : chr "8cd49b13-f45a-4b47-a2bd-173ffa932c2f" "d2450b70-0337-4406-bdbb-bc1037f1734c" "a2057123-abf5-4a2c-abad-8ffe33512562" "1dec528d-eb34-4079-adce-0d7a40e82205" ...
$ UID : chr "3a83ddb66e2ae73798bdf1d705dc0932" "176354c5eef714957d486009feabf195" "e19a0fa00aeda885b8a436757e889bc9" "cd17d7b6d152cb6f23957346d11c3f07" ...
$ City : chr "Eva" "Marianna" "Sioux Falls" "New Richland" ...
$ State : chr "AL" "FL" "SD" "MN" ...
$ County : chr "Morgan" "Jackson" "Minnehaha" "Waseca" ...
$ Zip : int 35621 32446 57110 56072 23181 74423 44086 22641 32404 56362 ...
$ Lat : num 34.3 30.8 43.5 43.9 37.6 ...
$ Lng : num -86.7 -85.2 -96.6 -93.5 -76.9 ...
$ Population : int 2951 11303 17125 2162 5287 981 2558 479 40029 5840 ...
$ Area : chr "Suburban" "Urban" "Suburban" "Suburban" ...
$ Timezone : chr "America/Chicago" "America/Chicago" "America/Chicago" "America/Chicago" ...
$ Job : chr "Psychologist, sport and exercise" "Community development worker" "Chief Executive Officer" "Early years teacher" ...
$ Children : int 1 3 3 0 NA NA 0 7 NA 2 ...
$ Age : int 53 51 53 78 22 76 50 40 48 78 ...
$ Education : chr "Some College, Less than 1 Year" "Some College, 1 or More Years, No Degree" "Some College, 1 or More Years, No Degree" "GED or Alternative Credential" ...
$ Employment : chr "Full Time" "Full Time" "Retired" "Retired" ...
$ Income : num 86576 46806 14370 39741 1210 ...
$ Marital : chr "Divorced" "Married" "Widowed" "Married" ...
$ Gender : chr "Male" "Female" "Female" "Male" ...
$ ReAdmis : chr "No" "No" "No" "No" ...
$ VitD_levels : num 17.8 19 17.4 17.4 16.9 ...
$ Doc_visits : int 6 4 4 4 5 6 6 7 6 7 ...
$ Full_meals_eaten : int 0 2 1 1 0 0 0 2 3 1 ...
$ VitD_supp : int 0 1 0 0 2 0 0 0 0 2 ...
$ Soft_drink : chr NA "No" "No" "No" ...
$ Initial_admin : chr "Emergency Admission" "Emergency Admission" "Elective Admission" "Elective Admission" ...
$ HighBlood : chr "Yes" "Yes" "Yes" "No" ...
$ Stroke : chr "No" "No" "No" "Yes" ...
$ Complication_risk : chr "Medium" "High" "Medium" "Medium" ...
$ Overweight : int 0 1 1 0 0 1 1 1 1 1 ...
$ Arthritis : chr "Yes" "No" "No" "Yes" ...
$ Diabetes : chr "Yes" "No" "Yes" "No" ...
$ Hyperlipidemia : chr "No" "No" "No" "No" ...
$ BackPain : chr "Yes" "No" "No" "No" ...
$ Anxiety : int 1 NA NA NA 0 0 1 0 NA 0 ...
$ Allergic_rhinitis : chr "Yes" "No" "No" "No" ...
$ Reflux_esophagitis: chr "No" "Yes" "No" "Yes" ...
$ Asthma : chr "Yes" "No" "No" "Yes" ...
$ Services : chr "Blood Work" "Intravenous" "Blood Work" "Blood Work" ...
$ Initial_days : num 10.59 15.13 4.77 1.71 1.25 ...
$ TotalCharge : num 3191 4215 2178 2465 1886 ...
$ Additional_charges: num 17939 17613 17505 12993 3717 ...
$ Item1 : int 3 3 2 3 2 4 4 1 3 5 ...
$ Item2 : int 3 4 4 5 1 5 3 2 3 5 ...
$ Item3 : int 2 3 4 5 3 4 3 2 2 5 ...
$ Item4 : int 2 4 4 3 3 4 2 5 3 3 ...
$ Item5 : int 4 4 3 4 5 3 3 4 3 4 ...
$ Item6 : int 3 4 4 5 3 5 4 2 3 2 ...
$ Item7 : int 3 3 3 5 4 4 5 4 4 3 ...
$ Item8 : int 4 3 3 5 3 6 5 2 2 2 ...
Unnamed: integer index
CaseOrder: integer index to preserve original order of raw data
Customer_id: character string unique to patient
Interaction: character string unique to patient transactions, procedures and admissions
UID: character string unique to patient transactions, procedures and admissions
City: character string indicating patient’s city of residence as listed on the billing statement
State: character string indicating patient’s state of residence as listed on the billing statement
County: character string indicating patient’s county of residence as listed on the billing statement
Zip: integer indicating patient’s zip code of residence as listed on the billing statement
Lat: continuous numeric GPS coordinates indicating latitude of patient’s residence as listed on the billing statement
Lng: continuous numeric GPS coordinates indicating longitude of patient’s residence as listed on the billing statement
Population: integer value indicating population within a mile radius of patient- based on census data
Area: nominal categorical - character string indicating area type- based on census data
unique(df_raw$Area)
[1] "Suburban" "Urban" "Rural"
Timezone: nominal categorical - character string indicating timezone of patient’s residence as provided by patient
unique(df_raw$Timezone)
[1] "America/Chicago" "America/New_York" "America/Los_Angeles"
[4] "America/Indiana/Indianapolis" "America/Detroit" "America/Denver"
[7] "America/Nome" "America/Anchorage" "America/Phoenix"
[10] "America/Boise" "America/Puerto_Rico" "America/Yakutat"
[13] "Pacific/Honolulu" "America/Menominee" "America/Kentucky/Louisville"
[16] "America/Indiana/Vincennes" "America/Toronto" "America/Indiana/Marengo"
[19] "America/Indiana/Winamac" "America/Indiana/Tell_City" "America/Sitka"
[22] "America/Indiana/Knox" "America/North_Dakota/New_Salem" "America/Indiana/Vevay"
[25] "America/Adak" "America/North_Dakota/Beulah"
Job: nominal categorical - character string indicating patient’s (or primary insurance holder’s) job as provided by patient
Children: integer indicating number of children in patient’s household as provided by patient
Age: integer indicating patient’s age as provided by patient
Education: nominal categorical - character string indicating patient’s highest earned degree as provided by patient
unique(df_raw$Education)
[1] "Some College, Less than 1 Year" "Some College, 1 or More Years, No Degree"
[3] "GED or Alternative Credential" "Regular High School Diploma"
[5] "Bachelor's Degree" "Master's Degree"
[7] "Nursery School to 8th Grade" "9th Grade to 12th Grade, No Diploma"
[9] "Doctorate Degree" "Associate's Degree"
[11] "Professional School Degree" "No Schooling Completed"
Employment: categorical - character string indicating patient’s employment status as provided by patient
unique(df_raw$Employment)
[1] "Full Time" "Retired" "Unemployed" "Student" "Part Time"
Income: - numeric value indicating annual income of patient (or primary insurance holder) as provided by patient
Marital: nominal categorical - character string indicating patient’s (or primary insurance holder’s) marital status as provided by patient
unique(df_raw$Marital)
[1] "Divorced" "Married" "Widowed" "Never Married" "Separated"
Gender: nominal categorical - character string indicating patient’s gender as provided by patient
unique(df_raw$Gender)
[1] "Male" "Female" "Prefer not to answer"
ReAdmis: binary categorical - character string indicating whether or not patient was readmitted within a month of release [Yes, No] *target variable
VitD_levels: continuous numeric value indicating patient’s vitamin D levels as measured in ng/mL
Doc_visits: integer indicating number of times the primary physician visited the patient during the initial hospitalization
Full_meals_eaten: integer indicating number of full meals eaten (partial meals count as 0) VitD_supp: integer indicating number of times that vitamin D supplements were administered to patient
Soft_drink: binary categorical - character string indicating whether or not patient regularly drinks three or more sodas in a day [Yes, No]
Initial_admin: nominal categorical - character string indicating the means by which the patient was initially admitted into the hospital
unique(df_raw$Initial_admin)
[1] "Emergency Admission" "Elective Admission" "Observation Admission"
HighBlood: binary categorical - character string indicating whether or not the patient has high blood pressure [Yes, No]
Stroke: binary categorical - character string indicating whether or not the patient has had a stroke [Yes, No]
Complication_risk: ordinal categorical - character string indicating level of complication risk [High, Medium, Low]
Overweight: binary categorical - integer indicating whether (1) or not (0) the patient is overweight, as determined by age, gender, and height
Arthritis: binary categorical - character string indicating whether or not the patient has arthritis [Yes, No]
Diabetes: binary categorical - character string indicating whether or not the patient has diabetes [Yes, No]
Hyperlipidemia: binary categorical - character string indicating whether or not the patient has hyperlipidemia [Yes, No]
BackPain: binary categorical - character string indicating whether or not the patient has chronic backpain [Yes, No]
Anxiety: binary categorical - integer indicating whether (1) or not (0) the patient has an anxiety disorder
Allergic_rhinitis: binary categorical - character string indicating whether or not the patient has allergic rhinitis [Yes, No]
Reflux_esophagitis: binary categorical - character string indicating whether or not the patient has reflux esophagitis [Yes, No]
Asthma: binary categorical - character string indicating whether or not the patient has asthma [Yes, No]
Services: nominal categorical - character string indicating the primary service the patient received while hospitalized
unique(df_raw$Services)
[1] "Blood Work" "Intravenous" "CT Scan" "MRI"
Initial_days: numeric value indicating number of days the patient stayed in the hospital during the initial visit
TotalCharge: numeric value indicating patient’s average daily charges for typical (not specialized) treatments and services during the initial visit
Additional_charges: numeric value indicating patient’s average daily charges for additional treatments and services during the initial visit
Item1: integer value indicating the level of importance of timely admission from most important (1) to least important (8), as reported by patient
Item2: integer value indicating the level of importance of timely treatment from most important (1) to least important (8), as reported by patient
Item3: integer value indicating the level of importance of timely visits from most important (1) to least important (8), as reported by patient
Item4: integer value indicating the level of importance of reliability from most important (1) to least important (8), as reported by patient
Item5: integer value indicating the level of importance of options from most important (1) to least important (8), as reported by patient
Item6: integer value indicating the level of importance of hours of treatment from most important (1) to least important (8), as reported by patient
Item7: integer value indicating the level of importance of courteous staff from most important (1) to least important (8), as reported by patient
Item8: integer value indicating the level of importance of evidence of active listeing from doctor from most important (1) to least important (8), as reported by patient
Data-Cleaning Plan
In order to asses the quality of- and clean- the data, I will be using R. R was created by statisticians, making it ideal for statistical analysis. R also allows for interpretable visualizations, and packages built for cleaning data and identifying outliers. Some of these packages include readr for reading the raw data file into R, caret for one hot encoding using dummy variables, and dplyr and ggplot2, for various calculations and visualizations.
- Reset index
- Removal of redundant, irrelevant, or misleading variables
-‘X’, ‘Customer_id’, ‘Interaction_id’, ‘Job’, ‘Income’, ‘Marital’
- Data was collected on the job and marital status of either the patient or the primary insurance holder. The inconsistency of this data may lead to false conclusions, and will therefore be removed.
- Renaming misleading variables -Data was collected on the income of either the patient or the primary insurance holder, so the field name will be renamed to reflect this. - Re-expression of categorical data as numeric data
- Change character values to numeric values or separate into separate variables using dummy variables
(ref: https://stackoverflow.com/questions/54602192/make-only-some-features-dummyvars) - Imputation of NULL values
- Change NULL values of patient-provided data to reflect a “0” response - MICE imputation for all other NULL values
(ref: https://www.rdocumentation.org/packages/mice/versions/2.25/topics/mice) - Detection of outliers
- View summary of univariate statistics, search for flags
- Visualize potential outliers with boxplots
- Run hypothesis test on potential outliers (grubbs test)
- Standardize variables, when necessary
- No data will be deleted or changed unless it was an obvious mistake. Changing available data will skew models toward a conclusion that may not generalize if data has been overly-manipulated.
Data Cleaning
head(df_raw)
Remove irrelevant columns
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
df <- df_raw[,c(6:53)]
df <- select(df, c(-Job, -Marital))
Rename misleading variable names
names(df)[names(df) == 'Income'] <- 'Income_household'
names(df)[names(df) == 'Item1'] <- 'Survey_TimelyAdmin'
names(df)[names(df) == 'Item2'] <- 'Survey_TimelyTreatment'
names(df)[names(df) == 'Item3'] <- 'Survey_TimelyVisits'
names(df)[names(df) == 'Item4'] <- 'Survey_Reliability'
names(df)[names(df) == 'Item5'] <- 'Survey_Options'
names(df)[names(df) == 'Item6'] <- 'Survey_HoursTreatment'
names(df)[names(df) == 'Item7'] <- 'Survey_CourteousStaff'
names(df)[names(df) == 'Item8'] <- 'Survey_ActiveListening'
head(df)
Set index
num_rows <- dim(df)[1]
row.names(df) <- c(1:num_rows)
head(df)
Re-expression of categorical data as numeric data
State
x <- df[order(df$State),"State"]
unique(x)
[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD"
[22] "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "PR" "RI" "SC"
[43] "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI" "WV" "WY"
library(plyr)
-------------------------------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
-------------------------------------------------------------------------------------------------------------
Attaching package: ‘plyr’
The following objects are masked from ‘package:dplyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
data <- df$State
state_dict <- c(
"AL" = 1, "AK" = 2, "AZ" = 3, "AR" = 4, "CA" = 5, "CO" = 6, "CT" = 7, "DE" = 8, "DC" = 9, "FL" = 10,
"GA" = 11, "HI" = 12, "ID" = 13, "IL" = 14, "IN" = 15, "IA" = 16, "KS" = 17, "KY" = 18, "LA" = 19, "ME" = 20,
"MD" = 21, "MA" = 22, "MI" = 23, "MN" = 24, "MS" = 25, "MO" = 26, "MT" = 27, "NE" = 28, "NV" = 29, "NH" = 30,
"NJ" = 31, "NM" = 32, "NY" = 33, "NC" = 34, "ND" = 35, "OH" = 36, "OK" = 37, "OR" = 38, "PA" = 39, "PR" = 40,
"RI" = 41, "SC" = 42, "SD" = 43, "TN" = 44, "TX" = 45, "UT" = 46, "VT" = 47, "VA" = 48, "WA" = 49, "WV" = 50,
"WI" = 51, "WY" = 52)
state_val <- revalue(x= data, replace = state_dict)
df$State <- as.numeric(state_val)
Area
unique(df$Area)
[1] "Suburban" "Urban" "Rural"
data <- df$Area
my_dict <- c(
"Rural" = 1,
"Suburban" = 2,
"Urban" = 3)
area_val <- revalue(x= data, replace = my_dict)
df$Area <- as.numeric(area_val)
Timezone
unique(df$Timezone)
[1] "America/Chicago" "America/New_York" "America/Los_Angeles"
[4] "America/Indiana/Indianapolis" "America/Detroit" "America/Denver"
[7] "America/Nome" "America/Anchorage" "America/Phoenix"
[10] "America/Boise" "America/Puerto_Rico" "America/Yakutat"
[13] "Pacific/Honolulu" "America/Menominee" "America/Kentucky/Louisville"
[16] "America/Indiana/Vincennes" "America/Toronto" "America/Indiana/Marengo"
[19] "America/Indiana/Winamac" "America/Indiana/Tell_City" "America/Sitka"
[22] "America/Indiana/Knox" "America/North_Dakota/New_Salem" "America/Indiana/Vevay"
[25] "America/Adak" "America/North_Dakota/Beulah"
data <- df$Timezone
my_dict <- c(
"America/Puerto_Rico" = -4,
"America/Detroit" = -5,
"America/Indiana/Indianapolis" = -5,
"America/Indiana/Marengo" = -5,
"America/Indiana/Vincennes" = -5,
"America/Indiana/Vevay" = -5,
"America/Indiana/Winamac" = -5,
"America/Kentucky/Louisville" = -5,
"America/New_York" = -5,
"America/Toronto" = -5,
"America/Chicago" = -6,
"America/Indiana/Knox" = -6,
"America/Indiana/Tell_City" = -6,
"America/Menominee" = -6,
"America/North_Dakota/Beulah" = -6,
"America/North_Dakota/New_Salem" = -6,
"America/Boise" = -7,
"America/Denver" = -7,
"America/Phoenix" = -7,
"America/Los_Angeles" = -8,
"America/Anchorage" = -9,
"America/Nome" = -9,
"America/Sitka" = -9,
"America/Yakutat" = -9,
"America/Adak" = -10,
"Pacific/Honolulu" = -10)
timezone_val <- revalue(x= data, replace = my_dict)
df$Timezone <- as.numeric(timezone_val)
Education
unique(df$Education)
[1] "Some College, Less than 1 Year" "Some College, 1 or More Years, No Degree"
[3] "GED or Alternative Credential" "Regular High School Diploma"
[5] "Bachelor's Degree" "Master's Degree"
[7] "Nursery School to 8th Grade" "9th Grade to 12th Grade, No Diploma"
[9] "Doctorate Degree" "Associate's Degree"
[11] "Professional School Degree" "No Schooling Completed"
data <- df$Education
my_dict <- c(
"No Schooling Completed" = 0,
"Nursery School to 8th Grade" = 8,
"9th Grade to 12th Grade, No Diploma" = 12,
"GED or Alternative Credential" = 12,
"Regular High School Diploma" = 12,
"Some College, Less than 1 Year" = 13,
"Some College, 1 or More Years, No Degree" = 14,
"Associate's Degree" = 15,
"Bachelor's Degree" = 16,
"Master's Degree" = 18,
"Professional School Degree" = 20,
"Doctorate Degree" = 24
)
education_val <- revalue(x= data, replace = my_dict)
df$Education <- as.numeric(education_val)
Employment
unique(df$Employment)
[1] "Full Time" "Retired" "Unemployed" "Student" "Part Time"
library(caret)
Loading required package: lattice
Loading required package: ggplot2
dmy <- dummyVars(" ~ Employment", data = df)
oh_encoded <- data.frame(predict(dmy, newdata = df))
df$Employment_FullTime <- oh_encoded$EmploymentFull.Time
df$Employment_PartTime <- oh_encoded$EmploymentPart.Time
df$Employment_Retired <- oh_encoded$EmploymentRetired
df$Student <- oh_encoded$EmploymentStudent
df$Unemployed <- oh_encoded$EmploymentUnemployed
df <- select(df, -Employment)
Gender
unique(df$Gender)
[1] "Male" "Female" "Prefer not to answer"
dmy <- dummyVars(" ~ Gender", data = df)
oh_encoded <- data.frame(predict(dmy, newdata = df))
df$Female <- oh_encoded$GenderFemale
df$Male <- oh_encoded$GenderMale
df <- select(df, -Gender)
Readmission
unique(df$ReAdmis)
[1] "No" "Yes"
data <- df$ReAdmis
bi_dict <- c(
"No" = 0,
"Yes" = 1
)
binary_val <- revalue(x= data, replace = bi_dict)
df$ReAdmis <- as.numeric(binary_val)
Soft Drink
unique(df$Soft_drink)
[1] NA "No" "Yes"
data <- df$Soft_drink
binary_val <- revalue(x= data, replace = bi_dict)
df$Soft_drink <- as.numeric(binary_val)
Initial Admission
unique(df$Initial_admin)
[1] "Emergency Admission" "Elective Admission" "Observation Admission"
dmy <- dummyVars(" ~ Initial_admin", data = df)
oh_encoded <- data.frame(predict(dmy, newdata = df))
df$Admin_elective <- oh_encoded$Initial_adminElective.Admission
df$Admin_observation <- oh_encoded$Initial_adminObservation.Admission
df$Admin_emergency <- oh_encoded$Initial_adminEmergency.Admission
df <- select(df, -Initial_admin)
High blood pressure
data <- df$HighBlood
binary_val <- revalue(x= data, replace = bi_dict)
df$HighBlood <- as.numeric(binary_val)
Stroke
data <- df$Stroke
binary_val <- revalue(x= data, replace = bi_dict)
df$Stroke <- as.numeric(binary_val)
Complication Risk
unique(df$Complication_risk)
[1] "Medium" "High" "Low"
data <- df$Complication_risk
my_dict <- c(
"Low" = 1,
"Medium" = 2,
"High" = 3)
risk_val <- revalue(x= data, replace = my_dict)
df$Complication_risk <- as.numeric(risk_val)
Arthritis
data <- df$Arthritis
binary_val <- revalue(x= data, replace = bi_dict)
df$Arthritis <- as.numeric(binary_val)
Diabetes
data <- df$Diabetes
binary_val <- revalue(x= data, replace = bi_dict)
df$Diabetes <- as.numeric(binary_val)
Hyperlipidemia
data <- df$Hyperlipidemia
binary_val <- revalue(x= data, replace = bi_dict)
df$Hyperlipidemia <- as.numeric(binary_val)
Back Pain
data <- df$BackPain
binary_val <- revalue(x= data, replace = bi_dict)
df$BackPain <- as.numeric(binary_val)
Allergic rhinitis
data <- df$Allergic_rhinitis
binary_val <- revalue(x= data, replace = bi_dict)
df$Allergic_rhinitis <- as.numeric(binary_val)
Reflux esophagitis
data <- df$Reflux_esophagitis
binary_val <- revalue(x= data, replace = bi_dict)
df$Reflux_esophagitis <- as.numeric(binary_val)
Asthma
data <- df$Asthma
binary_val <- revalue(x= data, replace = bi_dict)
df$Asthma <- as.numeric(binary_val)
Services
unique(df$Services)
[1] "Blood Work" "Intravenous" "CT Scan" "MRI"
data <- df$Services
my_dict <- c(
"Blood Work" = 1,
"Intravenous" = 2,
"CT Scan" = 3,
"MRI" = 4)
risk_val <- revalue(x= data, replace = my_dict)
df$Services <- as.numeric(risk_val)
Imputation of NULL values
summary(df)
City State County Zip Lat Lng
Length:10000 Min. : 1.00 Length:10000 Min. : 610 Min. :17.97 Min. :-174.21
Class :character 1st Qu.:14.00 Class :character 1st Qu.:27592 1st Qu.:35.26 1st Qu.: -97.35
Mode :character Median :26.00 Mode :character Median :50207 Median :39.42 Median : -88.40
Mean :26.84 Mean :50159 Mean :38.75 Mean : -91.24
3rd Qu.:39.00 3rd Qu.:72412 3rd Qu.:42.04 3rd Qu.: -80.44
Max. :52.00 Max. :99929 Max. :70.56 Max. : -65.29
Population Area Timezone Children Age Education
Min. : 0.0 Min. :1.000 Min. :-10.000 Min. : 0.000 Min. :18.0 Min. : 0.00
1st Qu.: 694.8 1st Qu.:1.000 1st Qu.: -6.000 1st Qu.: 0.000 1st Qu.:35.0 1st Qu.:12.00
Median : 2769.0 Median :2.000 Median : -6.000 Median : 1.000 Median :53.0 Median :14.00
Mean : 9965.2 Mean :1.993 Mean : -5.861 Mean : 2.098 Mean :53.3 Mean :13.61
3rd Qu.: 13945.0 3rd Qu.:3.000 3rd Qu.: -5.000 3rd Qu.: 3.000 3rd Qu.:71.0 3rd Qu.:16.00
Max. :122814.0 Max. :3.000 Max. : -4.000 Max. :10.000 Max. :89.0 Max. :24.00
NA's :2588 NA's :2414
Income_household ReAdmis VitD_levels Doc_visits Full_meals_eaten VitD_supp
Min. : 154.1 Min. :0.0000 Min. : 9.519 Min. :1.000 Min. :0.000 Min. :0.0000
1st Qu.: 19450.8 1st Qu.:0.0000 1st Qu.:16.513 1st Qu.:4.000 1st Qu.:0.000 1st Qu.:0.0000
Median : 33942.3 Median :0.0000 Median :18.081 Median :5.000 Median :1.000 Median :0.0000
Mean : 40484.4 Mean :0.3669 Mean :19.413 Mean :5.012 Mean :1.001 Mean :0.3989
3rd Qu.: 54075.2 3rd Qu.:1.0000 3rd Qu.:19.790 3rd Qu.:6.000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :207249.1 Max. :1.0000 Max. :53.019 Max. :9.000 Max. :7.000 Max. :5.0000
NA's :2464
Soft_drink HighBlood Stroke Complication_risk Overweight Arthritis
Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :1.000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.000 Median :0.0000 Median :2.000 Median :1.0000 Median :0.0000
Mean :0.2581 Mean :0.409 Mean :0.1993 Mean :2.123 Mean :0.7091 Mean :0.3574
3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :3.000 Max. :1.0000 Max. :1.0000
NA's :2467 NA's :982
Diabetes Hyperlipidemia BackPain Anxiety Allergic_rhinitis Reflux_esophagitis
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.2738 Mean :0.3372 Mean :0.4114 Mean :0.3223 Mean :0.3941 Mean :0.4135
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
NA's :984
Asthma Services Initial_days TotalCharge Additional_charges Survey_TimelyAdmin
Min. :0.0000 Min. :1.000 Min. : 1.002 Min. : 1257 Min. : 3126 Min. :1.000
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.: 7.912 1st Qu.: 3253 1st Qu.: 7986 1st Qu.:3.000
Median :0.0000 Median :1.000 Median :34.447 Median : 5852 Median :11574 Median :4.000
Mean :0.2893 Mean :1.672 Mean :34.432 Mean : 5892 Mean :12935 Mean :3.519
3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:61.125 3rd Qu.: 7615 3rd Qu.:15626 3rd Qu.:4.000
Max. :1.0000 Max. :4.000 Max. :71.981 Max. :21524 Max. :30566 Max. :8.000
NA's :1056
Survey_TimelyTreatment Survey_TimelyVisits Survey_Reliability Survey_Options Survey_HoursTreatment
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000
Median :3.000 Median :4.000 Median :4.000 Median :3.000 Median :4.000
Mean :3.507 Mean :3.511 Mean :3.515 Mean :3.497 Mean :3.522
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :7.000 Max. :8.000 Max. :7.000 Max. :7.000 Max. :7.000
Survey_CourteousStaff Survey_ActiveListening Employment_FullTime Employment_PartTime Employment_Retired
Min. :1.000 Min. :1.00 Min. :0.0000 Min. :0.0000 Min. :0.000
1st Qu.:3.000 1st Qu.:3.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
Median :3.000 Median :3.00 Median :1.0000 Median :0.0000 Median :0.000
Mean :3.494 Mean :3.51 Mean :0.6029 Mean :0.0991 Mean :0.098
3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.000
Max. :7.000 Max. :7.00 Max. :1.0000 Max. :1.0000 Max. :1.000
Student Unemployed Female Male Admin_elective Admin_observation
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000 Median :0.0000 Median :0.0000
Mean :0.1017 Mean :0.0983 Mean :0.5018 Mean :0.4768 Mean :0.2504 Mean :0.2436
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
Admin_emergency
Min. :0.000
1st Qu.:0.000
Median :1.000
Mean :0.506
3rd Qu.:1.000
Max. :1.000
NULL values are found in 7 columns :
- Children - Age - Income_household - Soft_drink - Overweight - Anxiety - Initial_days
NULL values of the variables “Children”, “Soft_drink”, and “Anxiety” will be changed to 0. These variables consist of yes/no data, as reported by the patient. It is reasonable to believe that these variables would be left blank by the patient -and therefore recorded as null- if they did not apply to the patient (0).
var <- df$Soft_drink
df$Soft_drink <- replace(var, is.na(var), 0)
var <- df$Anxiety
df$Anxiety <- replace(var, is.na(var), 0)
var <- df$Children
df$Children <- replace(var, is.na(var), 0)
All other NULL values will be filled using MICE
library(mice)
Attaching package: ‘mice’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
cbind, rbind
md.pattern(df)
City State County Zip Lat Lng Population Area Timezone Children Education ReAdmis VitD_levels Doc_visits
4618 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1496 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1481 1 1 1 1 1 1 1 1 1 1 1 1 1 1
467 1 1 1 1 1 1 1 1 1 1 1 1 1 1
535 1 1 1 1 1 1 1 1 1 1 1 1 1 1
187 1 1 1 1 1 1 1 1 1 1 1 1 1 1
165 1 1 1 1 1 1 1 1 1 1 1 1 1 1
69 1 1 1 1 1 1 1 1 1 1 1 1 1 1
509 1 1 1 1 1 1 1 1 1 1 1 1 1 1
166 1 1 1 1 1 1 1 1 1 1 1 1 1 1
153 1 1 1 1 1 1 1 1 1 1 1 1 1 1
54 1 1 1 1 1 1 1 1 1 1 1 1 1 1
53 1 1 1 1 1 1 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0
Full_meals_eaten VitD_supp Soft_drink HighBlood Stroke Complication_risk Arthritis Diabetes
4618 1 1 1 1 1 1 1 1
1496 1 1 1 1 1 1 1 1
1481 1 1 1 1 1 1 1 1
467 1 1 1 1 1 1 1 1
535 1 1 1 1 1 1 1 1
187 1 1 1 1 1 1 1 1
165 1 1 1 1 1 1 1 1
69 1 1 1 1 1 1 1 1
509 1 1 1 1 1 1 1 1
166 1 1 1 1 1 1 1 1
153 1 1 1 1 1 1 1 1
54 1 1 1 1 1 1 1 1
53 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
Hyperlipidemia BackPain Anxiety Allergic_rhinitis Reflux_esophagitis Asthma Services TotalCharge
4618 1 1 1 1 1 1 1 1
1496 1 1 1 1 1 1 1 1
1481 1 1 1 1 1 1 1 1
467 1 1 1 1 1 1 1 1
535 1 1 1 1 1 1 1 1
187 1 1 1 1 1 1 1 1
165 1 1 1 1 1 1 1 1
69 1 1 1 1 1 1 1 1
509 1 1 1 1 1 1 1 1
166 1 1 1 1 1 1 1 1
153 1 1 1 1 1 1 1 1
54 1 1 1 1 1 1 1 1
53 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
Additional_charges Survey_TimelyAdmin Survey_TimelyTreatment Survey_TimelyVisits Survey_Reliability
4618 1 1 1 1 1
1496 1 1 1 1 1
1481 1 1 1 1 1
467 1 1 1 1 1
535 1 1 1 1 1
187 1 1 1 1 1
165 1 1 1 1 1
69 1 1 1 1 1
509 1 1 1 1 1
166 1 1 1 1 1
153 1 1 1 1 1
54 1 1 1 1 1
53 1 1 1 1 1
22 1 1 1 1 1
22 1 1 1 1 1
3 1 1 1 1 1
0 0 0 0 0
Survey_Options Survey_HoursTreatment Survey_CourteousStaff Survey_ActiveListening Employment_FullTime
4618 1 1 1 1 1
1496 1 1 1 1 1
1481 1 1 1 1 1
467 1 1 1 1 1
535 1 1 1 1 1
187 1 1 1 1 1
165 1 1 1 1 1
69 1 1 1 1 1
509 1 1 1 1 1
166 1 1 1 1 1
153 1 1 1 1 1
54 1 1 1 1 1
53 1 1 1 1 1
22 1 1 1 1 1
22 1 1 1 1 1
3 1 1 1 1 1
0 0 0 0 0
Employment_PartTime Employment_Retired Student Unemployed Female Male Admin_elective Admin_observation
4618 1 1 1 1 1 1 1 1
1496 1 1 1 1 1 1 1 1
1481 1 1 1 1 1 1 1 1
467 1 1 1 1 1 1 1 1
535 1 1 1 1 1 1 1 1
187 1 1 1 1 1 1 1 1
165 1 1 1 1 1 1 1 1
69 1 1 1 1 1 1 1 1
509 1 1 1 1 1 1 1 1
166 1 1 1 1 1 1 1 1
153 1 1 1 1 1 1 1 1
54 1 1 1 1 1 1 1 1
53 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
22 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
Admin_emergency Overweight Initial_days Age Income_household
4618 1 1 1 1 1 0
1496 1 1 1 1 0 1
1481 1 1 1 0 1 1
467 1 1 1 0 0 2
535 1 1 0 1 1 1
187 1 1 0 1 0 2
165 1 1 0 0 1 2
69 1 1 0 0 0 3
509 1 0 1 1 1 1
166 1 0 1 1 0 2
153 1 0 1 0 1 2
54 1 0 1 0 0 3
53 1 0 0 1 1 2
22 1 0 0 1 0 3
22 1 0 0 0 1 3
3 1 0 0 0 0 4
0 982 1056 2414 2464 6916
mymicecomplete <- complete(mymice, 2)
df<- mymicecomplete
Identifying Outliers
df <- df[,c(14, 1, 3, 2, 4:12, 44, 45, 46, 48, 47, 49, 50, 13, 15, 18, 16, 17, 19:32, 51:53, 33:43)]
head(df)
Population
library(ggplot2)
uni_out <- qplot(data = df, y= Population, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Population') +
geom_text(aes(label=ifelse(Population %in% boxplot.stats(Population)$out,
as.character(Zip), "")), hjust = 1.5)
uni_out
After a quick Google search of the populations in the top four zip codes, these populations are accurate, but they are outliers that may restrict the effectiveness of a model, so this variable will be standardized.
df$Population <- scale(x = df$Population)
Children
uni_out <- qplot(data = df, y= Children, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Children')
uni_out
Perform a grubbs test on these outliers
library(outliers)
x <- df$Children
grubbs.test(x)
Grubbs test for one outlier
data: x
G = 4.07815, U = 0.99834, p-value = 0.2254
alternative hypothesis: highest value 10 is an outlier
The p-value is above 0.05, these values will remain as-is.Age
uni_out <- qplot(data = df, y= Age, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Age')
uni_out
Education
uni_out <- qplot(data = df, y= Education, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Education')
uni_out
Extremely low and extremely high levels of education are outliers, so this variable will be standardized, so as to reduce any negative impact to the model.
df$Education <- scale(x = df$Education)
Income_household
uni_out <- qplot(data = df, y= Income_household, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Income_household')
uni_out
x <- df$Income_household
grubbs.test(x)
Grubbs test for one outlier
data: x
G = 5.79209, U = 0.99664, p-value = 3.379e-05
alternative hypothesis: highest value 207249.13 is an outlier
These values differ from the mean significantly- they will be standardized.
df$Income_household <- scale(x = df$Income_household)
VitD_levels
uni_out <- qplot(data = df, y= VitD_levels, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='VitD_levels')
uni_out
Check for correlation between VitD_levels and VitD_supp
plot(df$VitD_levels, df$VitD_supp,
main ='Vitamin D Levels and Supplements',
xlab ='Levels',
ylab = 'Supplements Administered')
There seem to be two separate clusters of data, so the cluster consisting of patients with high vitamin D levels that also received vitamin D supplements will be checked against potential reasons for the supplementation of vitamin D for patients with seemingly-normal levels of vitamin D.
high_VitD <- which(df$VitD_levels > 30 & df$VitD_supp>1)
out <- df[high_VitD,] ; out
plot(out$Age, out$Overweight)
Doc_visits
uni_out <- qplot(data = df, y= Doc_visits, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Doc_visits') +
geom_text(aes(label=ifelse(Doc_visits %in% boxplot.stats(Doc_visits)$out,
as.character(ReAdmis), "")), hjust = 1.5)
uni_out
Full Meals Eaten
uni_out <- qplot(data = df, y= Full_meals_eaten, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Full_meals_eaten')
uni_out
Hypothesis test
x <- df$Full_meals_eaten
grubbs.test(x)
Grubbs test for one outlier
data: x
G = 5.95030, U = 0.99646, p-value = 1.297e-05
alternative hypothesis: highest value 7 is an outlier
Standardize values
df$Full_meals_eaten <- scale(x = df$Full_meals_eaten)
Complication_risk
uni_out <- qplot(data = df, y= Complication_risk, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Complication_risk') +
geom_text(aes(label=ifelse(Complication_risk %in% boxplot.stats(Complication_risk)$out,
as.character(ReAdmis), "")), hjust = 1.5)
uni_out
There are no outliers within this variable.
Initial_days
uni_out <- qplot(data = df, y= Initial_days, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Initial_days') +
geom_text(aes(label=ifelse(Initial_days %in% boxplot.stats(Initial_days)$out,
as.character(Initial_days), "")), hjust = 1.5)
uni_out
There are no outliers within this variable
TotalCharge
uni_out <- qplot(data = df, y= TotalCharge, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='TotalCharge') +
geom_text(aes(label=ifelse(TotalCharge %in% boxplot.stats(TotalCharge)$out,
as.character(ReAdmis), "")), hjust = 1.5)
uni_out
Additional_charges
uni_out <- qplot(data = df, y= Additional_charges, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2),
main='Additional_charges') +
geom_text(aes(label=ifelse(Additional_charges %in% boxplot.stats(Additional_charges)$out,
as.character(ReAdmis), "")), hjust = 1.5)
uni_out
Standardize variable
df$Additional_charges <- scale(x = df$Additional_charges)
Survey Results
survey_results <- df[,46:53]
for (col in survey_results){
uni_out <- qplot(data = survey_results, y= col, x=1,
geom='boxplot',
outlier.color='deeppink2',
xlim=c(0,2))
print(uni_out)
}
for (col in survey_results){
x <- col
print(grubbs.test(x))
}
Grubbs test for one outlier
data: x
G = 4.34239, U = 0.99811, p-value = 0.06985
alternative hypothesis: highest value 8 is an outlier
Grubbs test for one outlier
data: x
G = 3.37574, U = 0.99886, p-value = 1
alternative hypothesis: highest value 7 is an outlier
Grubbs test for one outlier
data: x
G = 4.34653, U = 0.99811, p-value = 0.06854
alternative hypothesis: highest value 8 is an outlier
Grubbs test for one outlier
data: x
G = 3.36289, U = 0.99887, p-value = 1
alternative hypothesis: highest value 7 is an outlier
Grubbs test for one outlier
data: x
G = 3.40043, U = 0.99884, p-value = 1
alternative hypothesis: highest value 7 is an outlier
Grubbs test for one outlier
data: x
G = 3.36844, U = 0.99887, p-value = 1
alternative hypothesis: highest value 7 is an outlier
Grubbs test for one outlier
data: x
G = 3.43253, U = 0.99882, p-value = 1
alternative hypothesis: highest value 7 is an outlier
Grubbs test for one outlier
data: x
G = 3.34861, U = 0.99888, p-value = 1
alternative hypothesis: highest value 7 is an outlier
These values will remain.
Data Cleaning Strengths & Weaknesses
The strengths of this clean data are that nearly all values are numeric, allowing easy calculations and visualizations, and there are no null values remaining. The weaknesses are that not all new numeric values are not easily interpreted without a data dictionary.
The major limitation when cleaning the data set was not having access to the individuals that collected the data. Many more decisions could have been made had the variables been better explained. For example, VitD_supp is described as the “number of times” a patient is administered a vitamin D supplement. It was not clarified if this number was daily, over the course of the entire initial stay, or any specific dosage. This may affect the analysis in that results may be skewed in unanticipated directions due to unbalanced data.
Principle Component Analysis
(ref: http://factominer.free.fr/factomethods/principal-components-analysis.html)
head(df)
I will not use the target variable (ReAdmis), the qualititative variables (City and County), or the redundant variables (State, Lat, Lng, Timezone) Standardization
df_sub <- scale(x = df[,c(5, 8, 9, 11:53)])
head(df_sub)
Zip Population Area Children Age Education Employment_FullTime
[1,] -0.5292516 -0.4731446 0.008079945 -0.2681162 -0.02526499 -0.1970790 0.8115319
[2,] -0.6448340 0.0902373 1.232313962 0.6977202 -0.12221431 0.1264787 0.8115319
[3,] 0.2530317 0.4829587 0.008079945 0.6977202 -0.02526499 0.1264787 -1.2321143
[4,] 0.2152444 -0.5263663 0.008079945 -0.7510343 1.18660144 -0.5206366 -1.2321143
[5,] -0.9821161 -0.3155703 -1.216154073 -0.7510343 -1.52797937 -0.5206366 0.8115319
[6,] 0.8832923 -0.6060304 1.232313962 -0.7510343 1.08965213 -0.5206366 -1.2321143
Employment_PartTime Employment_Retired Unemployed Student Female Male Income_household
[1,] -0.3316476 -0.3296006 -0.3301597 -0.3364558 -1.0035563 1.0474759 1.59418085
[2,] -0.3316476 -0.3296006 -0.3301597 -0.3364558 0.9963566 -0.9545805 0.21068775
[3,] -0.3316476 3.0336712 -0.3301597 -0.3364558 0.9963566 -0.9545805 -0.91767138
[4,] -0.3316476 3.0336712 -0.3301597 -0.3364558 -1.0035563 1.0474759 -0.03506789
[5,] -0.3316476 -0.3296006 -0.3301597 -0.3364558 0.9963566 -0.9545805 -1.37549383
[6,] -0.3316476 3.0336712 -0.3301597 -0.3364558 -1.0035563 1.0474759 -0.95951123
VitD_levels VitD_supp Doc_visits Full_meals_eaten Soft_drink HighBlood Stroke Complication_risk
[1,] -0.23951785 -0.6346809 0.94459928 -0.993337188 -0.4912094 1.2020163 -0.4988811 -0.1688644
[2,] -0.06217740 0.9563968 -0.96793217 0.990559733 -0.4912094 1.2020163 -0.4988811 1.2006768
[3,] -0.29699603 -0.6346809 -0.96793217 -0.001388728 -0.4912094 1.2020163 -0.4988811 -0.1688644
[4,] -0.29637274 -0.6346809 -0.96793217 -0.001388728 -0.4912094 -0.8318523 2.0042853 -0.1688644
[5,] -0.37811197 2.5474746 -0.01166644 -0.993337188 2.0355880 -0.8318523 -0.4988811 -1.5384057
[6,] 0.08083369 -0.6346809 0.94459928 -0.993337188 -0.4912094 -0.8318523 -0.4988811 -0.1688644
Overweight Arthritis Diabetes Hyperlipidemia BackPain Anxiety Allergic_rhinitis
[1,] -1.5900719 1.3408228 1.6285072 -0.713232 1.1960691 1.5623419 1.2398683
[2,] 0.6288395 -0.7457362 -0.6139979 -0.713232 -0.8359885 -0.6400008 -0.8064566
[3,] 0.6288395 -0.7457362 1.6285072 -0.713232 -0.8359885 -0.6400008 -0.8064566
[4,] -1.5900719 1.3408228 -0.6139979 -0.713232 -0.8359885 -0.6400008 -0.8064566
[5,] -1.5900719 -0.7457362 -0.6139979 1.401928 -0.8359885 -0.6400008 1.2398683
[6,] 0.6288395 1.3408228 1.6285072 -0.713232 1.1960691 -0.6400008 1.2398683
Reflux_esophagitis Asthma Services Admin_elective Admin_observation Admin_emergency Initial_days
[1,] -0.8396186 1.5672823 -0.8069574 -0.5779372 -0.5674677 0.9880217 -0.9073416
[2,] 1.1908979 -0.6379833 0.3938721 -0.5779372 -0.5674677 0.9880217 -0.7347159
[3,] -0.8396186 -0.6379833 -0.8069574 1.7301187 -0.5674677 -1.0120223 -1.1282089
[4,] 1.1908979 1.5672823 -0.8069574 1.7301187 -0.5674677 -1.0120223 -1.2443604
[5,] -0.8396186 -0.6379833 1.5947016 1.7301187 -0.5674677 -1.0120223 -1.2618392
[6,] -0.8396186 -0.6379833 -0.8069574 -0.5779372 1.7620385 -1.0120223 -1.0831862
TotalCharge Additional_charges Survey_TimelyAdmin Survey_TimelyTreatment Survey_TimelyVisits
[1,] -0.7995390 0.764967085 -0.5027299 -0.4896481 -1.4631734
[2,] -0.4964039 0.715077864 -0.5027299 0.4766991 -0.4948898
[3,] -1.0995966 0.698600372 -1.4717544 0.4766991 0.4733939
[4,] -1.0144664 0.009003875 -0.5027299 1.4430463 1.4416775
[5,] -1.1860294 -1.408920097 -1.4717544 -2.4223426 -0.4948898
[6,] -0.9229888 -0.029336751 0.4662946 1.4430463 0.4733939
Survey_Reliability Survey_Options Survey_HoursTreatment Survey_CourteousStaff Survey_ActiveListening
[1,] -1.4620544 0.4883553 -0.5061140 -0.4836475 0.4703965
[2,] 0.4679230 0.4883553 0.4625253 -0.4836475 -0.4890090
[3,] 0.4679230 -0.4823371 0.4625253 -0.4836475 -0.4890090
[4,] -0.4970657 0.4883553 1.4311645 1.4744395 1.4298020
[5,] -0.4970657 1.4590477 -0.5061140 0.4953960 -0.4890090
[6,] 0.4679230 -0.4823371 1.4311645 0.4953960 2.3892076
Principal Component Analysis
library(FactoMineR)
df_sub.pca <- PCA(df_sub, scale.unit=TRUE, graph=F)
Scree Plot
eig.val <- df_sub.pca$eig
barplot(eig.val[, 2],
main = "Explained Variance (%)",
xlab = "Principal Components",
ylab = "(%) of Variance",
col = "darkblue")
Graph of Individuals
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "cos2", col.ind="darkblue", label="var", graph.type = "ggplot")
Graph of Variables
plot(df_sub.pca, choix = "var", autoLab = "auto", col.var="darkblue", label="var", graph.type = "ggplot", select="cos2 0.40")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "TotalCharge", col.ind="darkblue", label="var", graph.type = "ggplot")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "Additional_charges", col.ind="darkblue", label="var", graph.type = "ggplot")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "Survey_HoursTreatment", col.ind="darkblue", label="var", graph.type = "ggplot")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "Survey_TimelyVisits", col.ind="darkblue", label="var", graph.type = "ggplot")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "Survey_TimelyAdmin", col.ind="darkblue", label="var", graph.type = "ggplot")
plot(df_sub.pca, choix = "ind", autoLab = "auto", habillage = "Survey_TimelyTreatment", col.ind="darkblue", label="var", graph.type = "ggplot")
In order to determine the principal components, I ran the PCA using the package FactoMineR and viewed the Scree plot. The PCA graphs of individuals and variables allowed me to reduce the data to the most important variables in terms of the original vector of variables. The variables with cos2 over 0.4 were those that I identified as “most important,” and they were determined to be
- TotalCharge - Additional_charges - Survey_HoursTreatment - Survey_TimelyVisits - Survey_TimelyAdmin - Survey_TimelyTreatment
Hospitals can benefit from researching these components and analyzing the correlation between these principal components and readmission rates. Pending results of an in-depth analysis, the organization can obtain actionable insights and attempt to reduce readmission rates.