DATA 698 - Final Project

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.5     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(skimr)
library(data.table)

## 
## Attaching package: 'data.table'

## The following object is masked from 'package:purrr':
## 
##     transpose

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(mltools)

## 
## Attaching package: 'mltools'

## The following object is masked from 'package:tidyr':
## 
##     replace_na

library(corrplot)

## corrplot 0.92 loaded

library(ROCR)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(DMwR2)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(Rtsne)
library(doParallel)

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: iterators

## Loading required package: parallel

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(stringr)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library (e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:mltools':
## 
##     skewness

library(Matrix)

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

Introduction to Data

The data set describes readmitted patients in diabetics. There are about 50 variables included that cover chacterictics of readmitted and not-readmitted patients. Most variables describe the patient characteristics, medical conditions or features of medical threatments but other variables provide measures of quality and condition. The data types are varied and include discrete, continuous, and categorical (both nominal and ordinal) data.

The data was originally published by the UCI Machine Learning Repository.There are 101766 observations that has no missing values in any of the 43 columns.I will talk about the data chacteristics, statistics and description in below.

Data Description and Type:

The diabetes dataset we have used consists of 100,000 records and 50 features. Below given is the detailed description of 10 of the important features in the dataset.

Encounter ID: TYPE : Continuous . DESC : a unique identifier for each claim record
Race: TYPE : Categorical . DESC : African-American, Asian, Caucasian, Hispanic, Other.
Gender: TYPE : Categorical. DESC : Male, Female, Unknown.
Age: TYPE : Categorical. DESC: [0-10), [10-20), [20-30), [30-40), [40-50), [50-60), [60-70), [70-80), [80-90), [90-100).
Weight: TYPE : Continuous DESC: Weights for patientS.
Admission type ID: TYPE : Categorical DESC: Elective, Emergency, Newborn, Not Available, Not Mapped, Urgent.
Discharge disposition ID: TYPE : Categorical. DESC: Admitted as an inpatient to this hospital
Admission source ID: TYPE : Categorical. DESC: Clinic Referral Court/Law, Enforcement, Emergency Room, HMO Referral
Time in hospital: TYPE : Numerical DESC: Time (Days) stayed in the hospital by the patientS.
Readmitted: TYPE: Categorical. DESC : FALSE, TRUE.

Summary Statistics and Descripton

The current data set is composed of 101766 records and 50 features.

There are 101766 observations that has no missing values in any of the 43 columns, which indicates that the data has minimum missing values.
Looking at Missing values of each of the numerical variables, “weight” (0.96),“medical_specialty” (0.49),“diabetesMed” (0.51), “payer_code” (0.39) has the most missing percenatge in data.the rest of the variables Have missing values that are in teens or below.
Kurtosis For each of the variables also confirmed that “metformin.pioglitazone” and “glimepiride.pioglitazone”" is highly skilled with Kurtosis=101759.
“encounter_id”" have a median at 152388987, and mean at 165201645 But the range of it is from 12522 to 867222.
Besides few features, the skewness of the rest of the variables are alright. The next batch of variables with relatively high skewness (their Kurtosis value) is number_emergency ( 22.86 ), chlorpropamide ( 37.86 ), and tolazamide ( 53.88 ) .

Data Analysis

Histogram Plots

## Warning in melt(raw_data): The melt generic in data.table has been passed
## a data.frame and will attempt to redirect to the relevant reshape2 method;
## please note that reshape2 is deprecated, and this redirection is now
## deprecated as well. To continue using melt methods from reshape2 while both
## libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(raw_data). In the next version, this warning will become an
## error.

Missing Percenatge

Let’s find out each column in data missing percentage

## Warning in plot.aggr(res, ...): not enough vertical space to display frequencies
## (too many combinations)

## 
##  Variables sorted by number of missings: 
##  Variable        Count
##    weight 0.9601079449
##  medical_ 0.4820744428
##  payer_co 0.4340585587
##      race 0.0272378982
##    diag_3 0.0171285550
##    diag_2 0.0041108532
##    diag_1 0.0001538074
##  encounte 0.0000000000
##  patient_ 0.0000000000
##    gender 0.0000000000
##       age 0.0000000000
##  admissio 0.0000000000
##  discharg 0.0000000000
##  admissio 0.0000000000
##  time_in_ 0.0000000000
##  num_lab_ 0.0000000000
##  num_proc 0.0000000000
##  num_medi 0.0000000000
##  number_o 0.0000000000
##  number_e 0.0000000000
##  number_i 0.0000000000
##  number_d 0.0000000000
##  max_glu_ 0.0000000000
##  A1Cresul 0.0000000000
##  metformi 0.0000000000
##  repaglin 0.0000000000
##  nateglin 0.0000000000
##  chlorpro 0.0000000000
##  glimepir 0.0000000000
##  acetohex 0.0000000000
##  glipizid 0.0000000000
##  glyburid 0.0000000000
##  tolbutam 0.0000000000
##  pioglita 0.0000000000
##  rosiglit 0.0000000000
##  acarbose 0.0000000000
##  miglitol 0.0000000000
##  troglita 0.0000000000
##  tolazami 0.0000000000
##   examide 0.0000000000
##  citoglip 0.0000000000
##   insulin 0.0000000000
##  glyburid 0.0000000000
##  glipizid 0.0000000000
##  glimepir 0.0000000000
##  metformi 0.0000000000
##  metformi 0.0000000000
##    change 0.0000000000
##  diabetes 0.0000000000
##  readmitt 0.0000000000

According to result of missing percentage above “weight” (0.96),“medical_specialty” (0.49),“diabetesMed” (0.51), “payer_code” (0.39) has the most missing percenatge in data.

Data Engineering

Cleaning

Dropping unnecessary/unvaluable variables or variables with too many missing values.I dropped “encounter_id”, “patient_nbr”, “weight”, “payer_code”, “medical_specialty”, “diabetesMed”, “diag_2”, “diag_3”,“diag_1” due to high missing values.One more addition to dropping cols that any patient who’s discharge status is “expired” will be dropped.I also needed to remove unknown gender because I cant impute the issing values in “gender” column.I will avoid cols that has zero change in values (nearzerovariance)

Encoding

For the Encoding process, I used the following process

Age : put mean value in any range of values. For example [0-10) =5,[10,20)=15,etc..
Medication Change : no change = 0, change = 1
Gender : Feamle = 0, Male = 1
Race : Caucasian = 0 African American = 1 Other = 2
Insulin dosage : no insulin = 0; decrease in insulin = -1; steady insulin = 1; increase in insulin = 2
rosiglitazone : No == 0; Steady == 1
pioglitazone : No == 0; Steady == 1
glyburide : No == 0; Steady == 1
glipizide : No == 0; Steady == 1
metformin : No == 0; Steady == 1
A1C results : None == 0 Normal == 1 abnormal (>7 or >8) == 2
Target(readmitted) : “readmitted” - No Readmission within 30 days == 0; Readmission in <30 days == 1

Correlations

Below shows top 10 Explanatory variables that positively correleted highly to Target

##    readmitted
## 1  1.00000000
## 2  0.10106864
## 3  0.06570063
## 4  0.05469331
## 5  0.04554967
## 6  0.04492613
## 7  0.03741080
## 8  0.03192364
## 9  0.02807725
## 10 0.01490265

##  [1] "num_medications"          "change"                  
##  [3] "time_in_hospital"         "number_diagnoses"        
##  [5] "num_lab_procedures"       "age"                     
##  [7] "num_procedures"           "metformin"               
##  [9] "A1Cresult"                "admission_type_id"       
## [11] "glyburide"                "number_inpatient"        
## [13] "discharge_disposition_id" "glipizide"               
## [15] "race"                     "admission_source_id"     
## [17] "number_emergency"         "pioglitazone"            
## [19] "insulin"                  "gender"

Data Visualization

The target variable, readmitted value , is shown on histogram here. We can see that it is a categorical variable with no gap at no clear patterns of missing value.There are over 60,000 as not-readmitted value (0), and over 1000 readmitted values (1).

Target Variable (readmitted)

Predictors

Visualization of histogram of each individual predictor variables indicate that beside the numerical variables, there are many categorical variables (discrete variables), such as num_impatient, number_emergncy, etc).
The obvious discrete variables are: Each of these varaiables have no more than 10-12 unique numbers to make the count.
- Age,
- Admission_type_id,
- time_in_hospital,
- num_medications,
- num_lab_procedure
There are some bi-mode variables:
- num_diagnosis
Histogram also indicated the right skewness of Age, which has a spike of counts at around the 70. Thi histogram also indicates left skewness of time_in_hospital ,which has a spike of counts at around 10.
We chose bins=15 and facet wrap for the histograms. This findings are preserved after changing the numbers of bins.

Outliers Analysis with Boxplot

Because some of the variables are skewed, so the box plot shows data many of these predictors are recognized as outliers. these variables include:

A1Cresult ,
number_emergncy,
number_medication,
time_in_hospital
num_lab_precedosure

Relationships Between the Target and Explanatory Variables

This plot below indicates relationship between target and explanatory variables

Among all 23 predicted variables, majority of them have clear Association with the target variable (readmitted). Maybe half of these predictors, if they are numerical and continuous, have clear relationship to the target in linear fashion. Explaining these variables from common sense perspective, they all make sense in target (readmitted), I feel that these variables are the predictors that have good and continuous measurement, oftentimes from the environment, rather than work worker controlled source. Therefore, it is not surprising that they have good linearity with target (readmitted). Above is the good news from predictors, which favors linear model, as well as other tree based the models. However, I have also seen that many other variables, even that they are linear and numerical predictors, they either have outliers, or their measure month is not continuous enough, in other words, interrupted in patterns, therefore may produce errors if I fit the linear model two outcome directly without tuning of these variables, or without other sophisticated modeling.

Finally many of the predictor variables are in discrete or nominal variable fashion, which has levels in less than 10 or even 3. so when we fit these variables into the model, we have to be oh extremely careful that the levels of predictors can be overly simplified in terms of explanation due to the overly crude way of describing the nature of this variable.

Data Visulazation

Age

Box plot

Density Distribution

Admission type ID

Box plot

Density Distribution

Discharge disposition ID

Box plot

Density Distribution

Time in Hospital

Box plot

Density Distribution

Admission Source ID

Box plot

Density Distribution

Number of Labs procedures

Box plot of Number

Density Distribution

Modeling Data PreProcessing

Splitting Data Set

Splitting dataset into training and test sets.

set.seed(123)
training.samples <- raw_data$readmitted %>%
createDataPartition(p = 0.8, list = FALSE)
train_data  <- raw_data[training.samples, ]
test_data <- raw_data[-training.samples, ]

I used the 80/20 rule to create the training data set and the testing data set. The function of createDataPartition is used for that purpose, which select the random sample from the completed Data.

Model Building & Evaluation

Model Performance Estimater

models_test_evaluation <- data.frame()

estimate_model_performance <- function(y_true, y_pred, model_name){
  cm <- confusionMatrix(table(y_true, y_pred))
  cm_table <- cm$table
  tpr <- cm_table[[1]] / (cm_table[[1]] + cm_table[[4]])
  fnr <- 1 - tpr
  fpr <- cm_table[[3]] / (cm_table[[3]] + cm_table[[4]])
  tnr <- 1 - fpr
  accuracy <- cm$overall[[1]]
  for_auc <- prediction(c(y_pred), y_true)
  auc <- performance(for_auc, "auc")
  auc <- auc@y.values[[1]]
  return(data.frame(Algo = model_name, AUC = auc, ACCURACY = accuracy, TPR = tpr, FPR = fpr, TNR = tnr, FNR = fnr))
}

Now the Data has been evaluated, with missing that is imputed. The Data is ready to go for various of Modeling effort.

First,before any modeling occured, I created an empty data frame called models_test_evaluation, which is the place holder for all the model evaluates. For each model, I will select Root Mean Square of Error (RMSE), R-squared, Mean Aboslute Error (MAE). Once these evaluators are available from the model run, they are put into this dataframe, one row per modeling.

I first will run the traditional linear regression model, as we have numerical outcome, and mostly numerical predictors.

Next I will apply a few of the tree-based model and rule based models, which are more modern, and utilizing the 33 variables in an ensumble (or bagged way/mechanism), rather than assuming all linearity relationship to the outcome for all 33 predictor variables indiviually, Which as we know is a very strict assumption, and our data does not fully support that assumption.

Most of the variables are associated with outcome, but not in the linear fashion.

Linear Regression Model

set.seed(123)
log_model <- lm(readmitted~.,data = train_data)
summary(log_model)

## 
## Call:
## lm(formula = readmitted ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63703 -0.09788 -0.07695 -0.05978  0.99010 
## 
## Coefficients:
##                             Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)               0.00363332  0.00734750   0.494              0.62096
## race                     -0.00041812  0.00228234  -0.183              0.85464
## gender                    0.00083050  0.00244298   0.340              0.73389
## age                       0.00047030  0.00008156   5.766        0.00000000816
## admission_type_id        -0.00135000  0.00083990  -1.607              0.10799
## discharge_disposition_id  0.00325101  0.00023631  13.758 < 0.0000000000000002
## admission_source_id      -0.00008385  0.00029906  -0.280              0.77919
## time_in_hospital          0.00158973  0.00048905   3.251              0.00115
## num_lab_procedures        0.00010665  0.00006850   1.557              0.11951
## num_procedures           -0.00136040  0.00077554  -1.754              0.07941
## num_medications           0.00037324  0.00018961   1.969              0.04901
## number_outpatient        -0.00059208  0.00111355  -0.532              0.59494
## number_emergency          0.00964340  0.00229663   4.199        0.00002686032
## number_inpatient          0.04554726  0.00202587  22.483 < 0.0000000000000002
## number_diagnoses          0.00281262  0.00066790   4.211        0.00002545088
## A1Cresult                -0.00188585  0.00185519  -1.017              0.30938
## metformin                -0.00783355  0.00321014  -2.440              0.01468
## glipizide                 0.00449001  0.00376110   1.194              0.23256
## glyburide                 0.00033572  0.00402747   0.083              0.93357
## pioglitazone             -0.00922421  0.00472671  -1.952              0.05100
## rosiglitazone            -0.00239546  0.00500115  -0.479              0.63195
## insulin                   0.00132763  0.00153234   0.866              0.38627
## change                    0.00874031  0.00294443   2.968              0.00299
##                             
## (Intercept)                 
## race                        
## gender                      
## age                      ***
## admission_type_id           
## discharge_disposition_id ***
## admission_source_id         
## time_in_hospital         ** 
## num_lab_procedures          
## num_procedures           .  
## num_medications          *  
## number_outpatient           
## number_emergency         ***
## number_inpatient         ***
## number_diagnoses         ***
## A1Cresult                   
## metformin                *  
## glipizide                   
## glyburide                   
## pioglitazone             .  
## rosiglitazone               
## insulin                     
## change                   ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2828 on 54787 degrees of freedom
## Multiple R-squared:  0.01841,    Adjusted R-squared:  0.01802 
## F-statistic: 46.71 on 22 and 54787 DF,  p-value: < 0.00000000000000022

plot(log_model)

First, we run linear regression model. This is our basic banhmark model.

Because linear regression is a traditional model, and our data contains mostly numerical continuous variable, and our outcome readmmitted is also bianary variable. Therefore we first chose linear model as the basic machine learning technique to predict readmitted values.

We used the LM function for linear regression model. All variables are fitted directly into them model as it was defined in the original data, with missing that is filled in.

The overall F statistics is 33.98 all 33 variables, with 2024 degree of freedoms (Our training data contains 2571 observations, minuus the corresponding num of variables, equals 2024.). There is a high significant P value for the overall model, but we have to be very careful that over fit Could be the culprit behind this P value.

Examining the T students statistics and associated P values with it, the following variables are highly significant: - Brand code C versus A, - Brand code D Versus A, - MFR, carb flow, carb pressure 1, - Temperature, usage count, balling, Oxygen filler, Bowl setpoint, pressure setpoint, balling lvl

The next few models are all assuming non-linear fashion, which are more popular machine learning algorithms and also more truthful to this data prediction. We chosed a few tree-based modeling.

The ensemble techniques of the nonlinear models have a few advantages. By packing or bagging the variables into trees,the variance of a prediction through these ensemble process are reduced, which fit even the unstable predictions with less stringent assumption than linear model.

Random Forest

# trainControl to 10 folds cross validation
set.seed(123)
train_data$readmitted = as.factor(train_data$readmitted)
test_data$readmitted = as.factor(test_data$readmitted)
trcontrol = trainControl("cv", number = 10, savePredictions=FALSE,  index = createFolds(train_data$readmitted, 10), verboseIter = FALSE)
rf_model <- train(readmitted ~., 
                 data = train_data,
                 method = "rf", 
                 trControl = trcontrol)

# Summary Model
summary(rf_model)

##                 Length Class      Mode     
## call                 4 -none-     call     
## type                 1 -none-     character
## predicted        54810 factor     numeric  
## err.rate          1500 -none-     numeric  
## confusion            6 -none-     numeric  
## votes           109620 matrix     numeric  
## oob.times        54810 -none-     numeric  
## classes              2 -none-     character
## importance          22 -none-     numeric  
## importanceSD         0 -none-     NULL     
## localImportance      0 -none-     NULL     
## proximity            0 -none-     NULL     
## ntree                1 -none-     numeric  
## mtry                 1 -none-     numeric  
## forest              14 -none-     list     
## y                54810 factor     numeric  
## test                 0 -none-     NULL     
## inbag                0 -none-     NULL     
## xNames              22 -none-     character
## problemType          1 -none-     character
## tuneValue            1 data.frame list     
## obsLevels            2 -none-     character
## param                0 -none-     list

plot(rf_model)

# Variable feature importance plot
varImp(rf_model)  %>% 
     ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall))

# Make predictions
#rf_pred <- predict(rf_model, newdata = test_data)
#rf_pred_class<-unlist(apply(round(rf_pred),1,which.max))-1
#rf_table<-table(test_data$target, rf_pred_class)
#base_metric_rf<-caret::confusionMatrix(rf_table)
#base_metric_rf

# Model performance metrics
#post_rst<-postResample(obs = test_data$readmitted, pred=rf_pred)
#models_test_evaluation <- data.frame(t(post_rst)) %>% 
#    mutate(Model = "Random Forest") %>% rbind(models_test_evaluation)
#base_metric_rf_table_standalone<-estimate_model_performance(test_data$target,rf_table,'RF')
#base_metric_rf_table_standalone

Random forest is one step further of the bagged tree model, but it differs from the simple bagged tree samples that it completely removes the inter-dependency of bootstrap samples from regular tree models. It reduces the correlation among predictors by and adding randomness to the construction process, hence with the name random forest.

As with other models, we specified 10 cross validations, 25 tuning algorithms, and we export the model evaluators for future comparisons.

NaiveBayes Model

train_data$readmitted = as.factor(train_data$readmitted)
test_data$readmitted = as.factor(test_data$readmitted)
nb_model<-naiveBayes(train_data$readmitted~.,data=train_data)

summary(nb_model)

##           Length Class  Mode     
## apriori    2     table  numeric  
## tables    22     -none- list     
## levels     2     -none- character
## isnumeric 22     -none- logical  
## call       4     -none- call

# Make Predictions
#nb_testpred<-predict(nb_model,test_data,type='raw')
#nb_testclass<-unlist(apply(round(nb_testpred),1,which.max))-1
#nb_table<-table(test_data$target, nb_testclass)
#base_metric_nb<-caret::confusionMatrix(nb_table)
#base_metric_nb
#base_metric_nb_table_standalone<-estimate_model_performance(test_data$target,nb_testclass,'NB')
#base_metric_nb_table_standalone

Model Evalution Summary

#models_test_evaluation %>% 
# select(log_model, rf_model, nb_model)

The table above shows our models performance.We evaluated models using below criteria:

Overall, Except the MARS model (R^2=0.27), the R squared are Within the range of 0.50 to 0.69 for all the tree based and rule based models.

Remember that the R squared in Linear model is 0.42(multiple R squared), and 0.4081(adjusted R squares), the lower R^2of MARS indicates that it is an inferior model to linear model.

The rest of five models have shown improvement in R-squared compared to the linear model. The improvements are most robust in random forest model (0.69 R squared, or 50% improvement from the linear model,), and the cubist model model (R-square 0.676, also 50% improvement from the linear model as well). The KNN and the SVM, bagged tree Model have R-squared around 0.53, not a significant improvement from linear model in terms of R squared.

Root Mean Squared Error

RMSE is interpreted as how far, on average, the residuals are from zero.

The RMSE is lowest in Cubist model (RMSE=0.10) and in random forest model (RMSE=0.101). The MARS have the worst performance in terms of RMSE (RMSE=0.15).The rest of 3 tree based models (KNN, SVM, bagged tree) have similar RMSE at 0.12.

Mean Absolute Error (MAE)

The MAE value follows exactly the same pattern as RMSE. The best the performers are cubist, random forest model. While the worst performer is MARS. The rest of three models perform similarly.

Based on what we’ve seen above table Cubist model gives best performance among the other models.So we are going to select Cubist models as champion model and predict values by using evaluation dataset and export in excel file.

Taking into all considerations of RMSE, R squared, MAE, Cubist is our best model. Random forest model follows very closely to Cubist.

The linear model and the MARS, multi-adaptive regression sblinds model clearly do not have much advantage in predicting PH from these 33 variables.

#final_results <- data.frame(rbind(base_metric_nb_table_standalone,base_metric_rf_table_standalone,base_metric_cubic_table_standalone,base_metric_knn_table_standalone,base_metric_mars_table_standalone))
#final_results

DATA 698 - Final Project

Don Padmaperuma

Introduction to Data

Data Description and Type:

Summary Statistics and Descripton

Data Analysis

Histogram Plots

Missing Percenatge

Data Engineering

Cleaning

Encoding

Correlations

Data Visualization

Target Variable (readmitted)

Predictors

Outliers Analysis with Boxplot

Relationships Between the Target and Explanatory Variables

Data Visulazation

Age

Box plot

Density Distribution

Admission type ID

Box plot

Density Distribution

Discharge disposition ID

Box plot

Density Distribution

Time in Hospital

Box plot

Density Distribution

Admission Source ID

Box plot

Density Distribution

Number of Labs procedures

Box plot of Number

Density Distribution

Modeling Data PreProcessing

Splitting Data Set

Model Building & Evaluation

Model Performance Estimater

Linear Regression Model

Random Forest

NaiveBayes Model

Model Evalution Summary