Image credit: rawpixel.
This analysis builds a cross-validated lasso regression model to predict Systolic blood pressure, using various numeric and categorical factors: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker.
\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(County General Hospital) + ... \\ + \theta(VA Hospital) + \theta(HealthFair) + \theta(Smoker) + b \\ \end{equation}\]Lasso regression is a powerful method which allows you to control a model’s robustness. A highly robust (regularized) model is able to cope with larger variations in the out-of-sample dataset, by maintaining greater accuracy. A tutorial video on lasso regression can be found here. Alternatively, the creator’s own writings about lasso regression can be read in this book.
This is just one of the five machine learning modeling guides you can find here.
Load the patient dataset.
Such that:
\[\begin{equation} \color{red}{\hat{Systolic}} = \theta_1(Age) + \theta_2(Gender) + \theta_3(Height) + \theta_4(Weight) ... \\ ...+ \theta_5(Smoker) + \theta_6(Location) + \theta_7(SelfAssessedHealthStatus) + b \\ \end{equation}\]In this project, we’ll be using lasso regularization, adjusting the complexity of the model using the following:
\[ J(\theta) \approx MSE + \lambda C \] Where \(\lambda\) is the amount of regularization and C is model complexity.
Identify the top two predictors.
What is the \(\lambda\) will you use in your preferred model?
There are two ways to load the required packages.
# #install.packages("pacman")
# library("pacman")
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(base64enc, ggplot2, kableExtra)
The dataset we will be loading appears as:
Document Preview
patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)
Examine the data structure.
#Preview structure
str(patients)
## 'data.frame': 100 obs. of 10 variables:
## $ Age : int 38 43 38 40 49 46 33 40 28 31 ...
## $ Diastolic : int 93 77 83 75 80 70 88 82 78 86 ...
## $ Gender : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
## $ Height : int 71 69 64 67 64 68 64 68 68 66 ...
## $ LastName : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
## $ Location : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
## $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
## $ Smoker : int 1 0 0 0 0 0 1 0 0 0 ...
## $ Systolic : int 124 109 125 117 122 121 130 115 115 118 ...
## $ Weight : int 176 163 131 133 119 142 142 180 183 132 ...
Examine the top 5 rows.
#Preview top 5 rows
head(patients, n=5)
## Age Diastolic Gender Height LastName Location
## 1 38 93 'Male' 71 'Smith' 'County General Hospital'
## 2 43 77 'Male' 69 'Johnson' 'VA Hospital'
## 3 38 83 'Female' 64 'Williams' 'St. Mary's Medical Center'
## 4 40 75 'Female' 67 'Jones' 'VA Hospital'
## 5 49 80 'Female' 64 'Brown' 'County General Hospital'
## SelfAssessedHealthStatus Smoker Systolic Weight
## 1 'Excellent' 1 124 176
## 2 'Fair' 0 109 163
## 3 'Good' 0 125 131
## 4 'Fair' 0 117 133
## 5 'Good' 0 122 119
Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.
Now remove from patients table the unwanted columns of Diastolic and LastName.
patientsOriginal <- patients
df <- patients[-c(2, 5)] #deletes columns 2 and 5
Split dataframes into categorical and numeric
df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]
One-hot encode categorical columns.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)
#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")
df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1
Standardize numerics, and then recombine ordered dataframes.
An additional interesting discussion on when to standardize is here
scaled_numericdf <- scale(df_numeric)
Recombine the forked categorical and numeric dataframes together using column bind.
df <- cbind(scaled_numericdf, df_categorical)
Plot histogram of numeric columns. For bin specification, see here.
For plotting multiples, see here, and consider cowplot.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=Age)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
HeightPlot <- ggplot(data=df, aes(x=Height)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
WeightPlot <- ggplot(data=df, aes(x=Weight)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
geom_histogram(bins = 20,
col="black",
fill="green",
alpha = .3)
grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)
Rename column headers for easier interpretation and reference.
names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"
Scatterplot visualization is an important step of the statistical analysis process, as descriptive statistics can oversimplify your understanding. For further understanding of this, please refer to Anscombe’s quartet.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)
Having completed the pre-processing and data exploration phases, we now move onto building a lasso regression model. SO users recommend glmnet over lars as the preferred and more actively maintained lasso regression package.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)
Function glmnet runs on a matrix, not a dataframe. Therefore, the df must be first converted.
#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df)
df_matrix <- as.matrix(df)
#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]
Now it’s time to fit the classification model.
fit <- glmnet(X, Y)
We can visualize the coefficients with the following plot.
plot(fit, label = TRUE)
Each curve is a factor, such as Weight or Height. Intreptting, this plot, we can see that as regularization \(\lambda \uparrow\) increases (leftward on the x-axis), coefficients shrink to zero and drop away from our model. The Coefficient value (y-axis) represents that factor’s influence on Systolic at that level of regularization. A positive coefficient value increases Systolic, while a negative coefficient decreases Systolic.
We can apply cross validation (10-fold) to randomly partition the data into 10 different training and testing datasets. When we plot regularization’s influence on MSE (mean square error), we should expect a positive relationship. Note: For reproducibility, we’ll set the random seed to 123.
set.seed(123)
cvfit = cv.glmnet(X, Y)
plot(cvfit)
As expected, as \(\lambda \uparrow\), MSE\(\uparrow\). The upper and lower standard deviations for \(\lambda\) are indicated by the vertical curves. It is recommended that your selected \(\lambda\) fit within their range.
In seeking to minimize cross-validated error, we would use the following function:
cvfit$lambda.min
where \(\lambda\) = 0.0431221 is the minimized cross-validated error. Alternatively, we could use the following:
cvfit$lambda.1se
Where \(\lambda\) = 0.2771919 is the largest \(\lambda\) at which the MSE is within one standard error of the minimal MSE.
We can now print the coefficent \(\theta\) values from our model.
MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.39871736
## Age 0.04952344
## Height 0.04157106
## Weight .
## Male .
## County General Hospital 0.03032041
## St.Mary's Medical Center .
## VA Hospital -0.09250336
## HealthExcellent .
## HealthFair -0.28799287
## HealthGood .
## HealthPoor .
## Smoker 1.36563993
For easier access, let’s produce a table of the coefficient names and their values.
From the above, we see the factors with the greatest impact on Systolic are Smoker and HealthFair. If you are a smoker, \(\hat{Systolic}\) increases by 1.37, and if your health is fair, \(\hat{Systolic}\) decreases by 0.29.
Our final minimized cross-validated error model is:
\[\begin{equation} \hat{Systolic} = 0.05(Age) + 0.04(Height) + 0.03(County General Hospital) + ... \\ - 0.09(VA Hospital) - 0.29(HealthFair) + 1.37(Smoker) -0.40 \\ \end{equation}\]Of note, the factors Weight, Gender, St.Mary’s location, HealthExcellent, HealthGood and HealthPoor were not contributing factors in this final model.
Linear Regression was used to construct a model for predicting Systolic blood pressure, given 13 numeric and categorical factors. A robust, 10-fold cross-validated model was then using lasso regularization, with \(\lambda\) set to minimize cross-validated error (\(\lambda\) = 0.0431221). The top two predictors in this regularized model composed of six factors were Smoker status and HealthFair.
Thank you for reading, and happy regressing!
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)
patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)
#Preview structure
str(patients)
#Preview top 5 rows
head(patients, n=5)
patientsOriginal <- patients
df <- patients[-c(2, 5)] #deletes columns 2 and 5
df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)
#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")
df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1
scaled_numericdf <- scale(df_numeric)
df <- cbind(scaled_numericdf, df_categorical)
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=Age)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
HeightPlot <- ggplot(data=df, aes(x=Height)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
WeightPlot <- ggplot(data=df, aes(x=Weight)) +
geom_histogram(bins = 20,
col="black",
fill="blue",
alpha = .3)
SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
geom_histogram(bins = 20,
col="black",
fill="green",
alpha = .3)
grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)
names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)
AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) +
geom_point(size=0.5) +
geom_smooth(method=lm)
grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)
#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df)
df_matrix <- as.matrix(df)
#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]
fit <- glmnet(X, Y)
plot(fit, label = TRUE)
set.seed(123)
cvfit = cv.glmnet(X, Y)
plot(cvfit)
cvfit$lambda.min
cvfit$lambda.1se
MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)
%Clear previous variables, wipe screen, close windows
clear all
clc
% close all
% https://www.mathworks.com/help/matlab/matlab_prog/create-a-table.html
load patients;
%Target———————————————————–
Y = Systolic;
%Standardize numericals——————————————-
XNumeric = [Age Height Weight];
XNumeric_scaled = zscore(XNumeric);
%One-hot encode categoricals————————————–
%Because these attributes are single-columns containing many values, we
%need to break them into binary attributes, one for each value.
Gender = nominal(Gender);
GenderCateg = dummyvar(Gender);
Location = nominal(Location);
LocationCateg = dummyvar(Location);
SelfAssessedHealthStatus = nominal(SelfAssessedHealthStatus);
SelfAssessedHealthStatusCateg = dummyvar(SelfAssessedHealthStatus);
% Bring Categorical together
%Now that we’ve broken each attribute value into a seperate binary vector,
%we need to bring them all back together into a single matrix.
XCateg = [GenderCateg LocationCateg SelfAssessedHealthStatusCateg Smoker];
% Merge numerical with categorical matrices———————-
XAll = [XNumeric_scaled XCateg];
%Lasso===========================================================
%We need to determine the number of k-folds and alpha value.
%With those values set, we can run our lasso linear regression.
%[B, FitInfo] = lasso(X, Y,Name, Value)
% Set cross validation k-fold, k
kfold = 10;
% ‘Alpha’, alpha value, where alpha = 1 is lasso, and = 0.00001 approaches
% ridge
alpha = 1;
% Don’t set lambda. It’s a vector, not a scalar.
% Default lambda count (steps) = 100
[B FitInfo] = lasso(XAll, Y, ‘CV’, kfold, ‘Alpha’, alpha);
% Lasso Plot of Coefficients=====================================
lassoPlot(B, FitInfo, ‘PredictorNames’, {‘Age’, ‘Height’, ‘Weight’,…
'Female', 'Male',... 'County General Hospital', 'St Marys Medical Center', 'VA Hospital',... 'Excellent', 'Fair', 'Good', 'Poor',... 'Smoker'},... 'PlotType', 'lambda',... 'XScale', 'log'),... ylabel('theta'),... xlabel('lambda')% Cross-validated Deviance of Lasso Plot==========================
%Product an extra graph to display cross-validation
lassoPlot(B, FitInfo, ‘PlotType’, ‘CV’);
% Theta vs Predictors=============================================
%Product an extra graph to display theta vs predictors
figure, pcolor(B), xlabel(‘Theta’), ylabel(‘Predictors’)
% Interpretation————————————————-
% Identify number of nonzero coefficients are minimum deviance plus one
% standard deviation.
indx = FitInfo.Index1SE;
B0 = B(:,indx);
nonzeros = sum(B0 ~= 0)
coding: utf-8
In[68]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn import preprocessing
In[69]:
Load data
Because there is header column, set header=0
patients = pd.read_csv(“C:/tmp/patients.csv”, header=0)
Backup patients, just in case we need it later
patientsBackup = patients
First split matrix into y (dependent) and x (independent)Remember, Python is 0-offset! The “3rd” entry is at position 2.
patientsY = Systolic
patientsX = Everything else exluding LastName and Systolic
In[61]:
split dependent variable and independent variables
patientsY = patients[“Systolic”]
patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]
patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]
Smoker is not pulled with the other categorical data, so this next code line > was added
patientsXBinary = patients[[“Smoker”]]
In[62]:
Standardize
patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]
patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)
Standardizing removed the header from the row. I need to fix this.
In[63]:
Return only object datatypes (non-numeric here)
categoriesX = patientsX.select_dtypes(include=[object]).copy()
categoriesX.head()
One-Hot Encoding
As said in this terrific one-hot > tutorial:
There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.
There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.
LastName is not to be included in the linear regression.
In[34]:
categoriesX_onehot = categoriesX.copy()
categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, > “Location”, “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])
categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])
Return results
print(categoriesX_onehot.head());
print(categoriesXBinary_onehot.head());
Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge > on columns.
In[35]:
print(“categoriesX_onehot is:”, type(categoriesX_onehot))
print(categoriesX_onehot.shape)
print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))
print(categoriesXBinary_onehot.shape)
print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))
print(patientsXNumeric_scaled.shape)
patientsXNumeric_scaled is an array. I used > https://stackoverflow.com/questions/20763012/creating-a-pandas-dataframe- > from-a-numpy-array-how-do-i-specify-the-index-colum to convert it to a > dataframe.
In[36]:
patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)
In[37]:
print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))
In[38]:
print(patientsXNumeric_scaleddf.head())
Looks better, but it still needs column names.
In[39]:
patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]
Now we bring all the columns back together as one dataframe.
In[40]:
Bring them back together
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)
top 3 rows
print(patientsXAll.head(3))
In[40]:
Bring them back together
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)
patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)
top 3 rows
print(patientsXAll.head(3))
lasso = linear_model.Lasso()
print(“Cross Val Score:” + str(cross_val_score(lasso, patientsXAll, > patientsY, cv=10)))
To view this entire document’s markdown code, click here.
If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.
| Age | Diastolic | Gender | Height | LastName | Location | SelfAssessedHealthStatus | Smoker | Systolic | Weight |
|---|---|---|---|---|---|---|---|---|---|
| 38 | 93 | ‘Male’ | 71 | ‘Smith’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 124 | 176 |
| 43 | 77 | ‘Male’ | 69 | ‘Johnson’ | ‘VA Hospital’ | ‘Fair’ | 0 | 109 | 163 |
| 38 | 83 | ‘Female’ | 64 | ‘Williams’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 125 | 131 |
| 40 | 75 | ‘Female’ | 67 | ‘Jones’ | ‘VA Hospital’ | ‘Fair’ | 0 | 117 | 133 |
| 49 | 80 | ‘Female’ | 64 | ‘Brown’ | ‘County General Hospital’ | ‘Good’ | 0 | 122 | 119 |
| 46 | 70 | ‘Female’ | 68 | ‘Davis’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 121 | 142 |
| 33 | 88 | ‘Female’ | 64 | ‘Miller’ | ‘VA Hospital’ | ‘Good’ | 1 | 130 | 142 |
| 40 | 82 | ‘Male’ | 68 | ‘Wilson’ | ‘VA Hospital’ | ‘Good’ | 0 | 115 | 180 |
| 28 | 78 | ‘Male’ | 68 | ‘Moore’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 115 | 183 |
| 31 | 86 | ‘Female’ | 66 | ‘Taylor’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 118 | 132 |
| 45 | 77 | ‘Female’ | 68 | ‘Anderson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 114 | 128 |
| 42 | 68 | ‘Female’ | 66 | ‘Thomas’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 115 | 137 |
| 25 | 74 | ‘Male’ | 71 | ‘Jackson’ | ‘VA Hospital’ | ‘Poor’ | 0 | 127 | 174 |
| 39 | 95 | ‘Male’ | 72 | ‘White’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 130 | 202 |
| 36 | 79 | ‘Female’ | 65 | ‘Harris’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 114 | 129 |
| 48 | 92 | ‘Male’ | 71 | ‘Martin’ | ‘VA Hospital’ | ‘Good’ | 1 | 130 | 181 |
| 32 | 95 | ‘Male’ | 69 | ‘Thompson’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 1 | 124 | 191 |
| 27 | 79 | ‘Female’ | 69 | ‘Garcia’ | ‘VA Hospital’ | ‘Fair’ | 1 | 123 | 131 |
| 37 | 77 | ‘Male’ | 70 | ‘Martinez’ | ‘County General Hospital’ | ‘Good’ | 0 | 119 | 179 |
| 50 | 76 | ‘Male’ | 68 | ‘Robinson’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 172 |
| 48 | 75 | ‘Female’ | 65 | ‘Clark’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 121 | 133 |
| 39 | 79 | ‘Female’ | 64 | ‘Rodriguez’ | ‘VA Hospital’ | ‘Fair’ | 0 | 123 | 117 |
| 41 | 88 | ‘Female’ | 62 | ‘Lewis’ | ‘VA Hospital’ | ‘Fair’ | 0 | 114 | 137 |
| 44 | 90 | ‘Female’ | 66 | ‘Lee’ | ‘County General Hospital’ | ‘Fair’ | 1 | 128 | 146 |
| 28 | 96 | ‘Female’ | 65 | ‘Walker’ | ‘County General Hospital’ | ‘Good’ | 1 | 129 | 123 |
| 25 | 77 | ‘Male’ | 70 | ‘Hall’ | ‘VA Hospital’ | ‘Poor’ | 0 | 114 | 189 |
| 39 | 80 | ‘Female’ | 63 | ‘Allen’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 113 | 143 |
| 25 | 76 | ‘Female’ | 63 | ‘Young’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 114 |
| 36 | 83 | ‘Male’ | 68 | ‘Hernandez’ | ‘County General Hospital’ | ‘Poor’ | 0 | 120 | 166 |
| 30 | 89 | ‘Male’ | 67 | ‘King’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 127 | 186 |
| 45 | 92 | ‘Female’ | 70 | ‘Wright’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 134 | 126 |
| 40 | 83 | ‘Female’ | 66 | ‘Lopez’ | ‘VA Hospital’ | ‘Poor’ | 0 | 121 | 137 |
| 25 | 80 | ‘Female’ | 64 | ‘Hill’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 115 | 138 |
| 47 | 84 | ‘Male’ | 70 | ‘Scott’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 127 | 187 |
| 44 | 92 | ‘Male’ | 71 | ‘Green’ | ‘County General Hospital’ | ‘Good’ | 0 | 121 | 193 |
| 48 | 83 | ‘Female’ | 66 | ‘Adams’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 127 | 137 |
| 44 | 90 | ‘Male’ | 71 | ‘Baker’ | ‘VA Hospital’ | ‘Good’ | 1 | 136 | 192 |
| 35 | 85 | ‘Female’ | 66 | ‘Gonzalez’ | ‘St. Mary’s Medical Center’ | ‘Fair’ | 0 | 117 | 118 |
| 33 | 90 | ‘Male’ | 66 | ‘Nelson’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 124 | 180 |
| 38 | 74 | ‘Female’ | 63 | ‘Carter’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 120 | 128 |
| 39 | 92 | ‘Male’ | 71 | ‘Mitchell’ | ‘County General Hospital’ | ‘Fair’ | 1 | 128 | 164 |
| 44 | 80 | ‘Male’ | 69 | ‘Perez’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 116 | 183 |
| 44 | 89 | ‘Male’ | 70 | ‘Roberts’ | ‘VA Hospital’ | ‘Good’ | 1 | 132 | 169 |
| 37 | 96 | ‘Male’ | 70 | ‘Turner’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 137 | 194 |
| 45 | 89 | ‘Male’ | 67 | ‘Phillips’ | ‘VA Hospital’ | ‘Good’ | 0 | 117 | 172 |
| 37 | 77 | ‘Female’ | 65 | ‘Campbell’ | ‘County General Hospital’ | ‘Fair’ | 0 | 116 | 135 |
| 30 | 81 | ‘Male’ | 68 | ‘Parker’ | ‘VA Hospital’ | ‘Poor’ | 0 | 119 | 182 |
| 39 | 76 | ‘Female’ | 62 | ‘Evans’ | ‘County General Hospital’ | ‘Good’ | 0 | 123 | 121 |
| 42 | 83 | ‘Male’ | 70 | ‘Edwards’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 116 | 158 |
| 42 | 78 | ‘Male’ | 67 | ‘Collins’ | ‘County General Hospital’ | ‘Good’ | 1 | 124 | 179 |
| 49 | 95 | ‘Male’ | 68 | ‘Stewart’ | ‘County General Hospital’ | ‘Poor’ | 1 | 129 | 170 |
| 44 | 91 | ‘Female’ | 62 | ‘Sanchez’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 130 | 136 |
| 43 | 91 | ‘Female’ | 64 | ‘Morris’ | ‘County General Hospital’ | ‘Poor’ | 1 | 132 | 135 |
| 47 | 86 | ‘Female’ | 66 | ‘Rogers’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 117 | 147 |
| 50 | 89 | ‘Male’ | 72 | ‘Reed’ | ‘VA Hospital’ | ‘Excellent’ | 1 | 129 | 186 |
| 38 | 79 | ‘Female’ | 63 | ‘Cook’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 118 | 124 |
| 41 | 74 | ‘Female’ | 66 | ‘Morgan’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 120 | 134 |
| 45 | 82 | ‘Male’ | 70 | ‘Bell’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 1 | 138 | 170 |
| 36 | 76 | ‘Male’ | 71 | ‘Murphy’ | ‘VA Hospital’ | ‘Good’ | 0 | 117 | 180 |
| 38 | 81 | ‘Female’ | 68 | ‘Bailey’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 113 | 130 |
| 29 | 77 | ‘Female’ | 63 | ‘Rivera’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 122 | 130 |
| 28 | 73 | ‘Female’ | 65 | ‘Cooper’ | ‘VA Hospital’ | ‘Good’ | 0 | 115 | 127 |
| 30 | 85 | ‘Female’ | 67 | ‘Richardson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 120 | 141 |
| 28 | 76 | ‘Female’ | 66 | ‘Cox’ | ‘County General Hospital’ | ‘Good’ | 0 | 117 | 111 |
| 29 | 80 | ‘Female’ | 68 | ‘Howard’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 123 | 134 |
| 36 | 80 | ‘Male’ | 71 | ‘Ward’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 123 | 189 |
| 45 | 79 | ‘Female’ | 70 | ‘Torres’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 119 | 137 |
| 32 | 82 | ‘Female’ | 60 | ‘Peterson’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 110 | 136 |
| 31 | 79 | ‘Female’ | 64 | ‘Gray’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 121 | 130 |
| 48 | 82 | ‘Female’ | 64 | ‘Ramirez’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 138 | 137 |
| 25 | 75 | ‘Male’ | 66 | ‘James’ | ‘County General Hospital’ | ‘Good’ | 0 | 125 | 186 |
| 40 | 91 | ‘Female’ | 64 | ‘Watson’ | ‘VA Hospital’ | ‘Fair’ | 1 | 122 | 127 |
| 39 | 74 | ‘Male’ | 72 | ‘Brooks’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 0 | 120 | 176 |
| 41 | 78 | ‘Female’ | 65 | ‘Kelly’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 117 | 127 |
| 33 | 85 | ‘Female’ | 67 | ‘Sanders’ | ‘St. Mary’s Medical Center’ | ‘Excellent’ | 1 | 125 | 115 |
| 31 | 84 | ‘Male’ | 72 | ‘Price’ | ‘VA Hospital’ | ‘Fair’ | 1 | 124 | 178 |
| 35 | 75 | ‘Female’ | 64 | ‘Bennett’ | ‘County General Hospital’ | ‘Fair’ | 0 | 121 | 131 |
| 32 | 78 | ‘Male’ | 68 | ‘Wood’ | ‘St. Mary’s Medical Center’ | ‘Poor’ | 0 | 118 | 183 |
| 42 | 81 | ‘Male’ | 66 | ‘Barnes’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 120 | 194 |
| 48 | 79 | ‘Female’ | 64 | ‘Ross’ | ‘VA Hospital’ | ‘Good’ | 0 | 118 | 126 |
| 34 | 85 | ‘Male’ | 68 | ‘Henderson’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 118 | 186 |
| 39 | 79 | ‘Male’ | 69 | ‘Coleman’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 122 | 188 |
| 28 | 82 | ‘Male’ | 69 | ‘Jenkins’ | ‘County General Hospital’ | ‘Good’ | 1 | 134 | 189 |
| 29 | 80 | ‘Female’ | 64 | ‘Perry’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 131 | 120 |
| 32 | 80 | ‘Female’ | 63 | ‘Powell’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 113 | 132 |
| 39 | 92 | ‘Male’ | 68 | ‘Long’ | ‘County General Hospital’ | ‘Good’ | 1 | 125 | 182 |
| 37 | 92 | ‘Female’ | 65 | ‘Patterson’ | ‘County General Hospital’ | ‘Poor’ | 1 | 135 | 120 |
| 49 | 96 | ‘Female’ | 63 | ‘Hughes’ | ‘County General Hospital’ | ‘Good’ | 1 | 128 | 123 |
| 31 | 87 | ‘Female’ | 66 | ‘Flores’ | ‘VA Hospital’ | ‘Good’ | 1 | 123 | 141 |
| 37 | 81 | ‘Female’ | 65 | ‘Washington’ | ‘St. Mary’s Medical Center’ | ‘Good’ | 0 | 122 | 129 |
| 38 | 90 | ‘Male’ | 68 | ‘Butler’ | ‘County General Hospital’ | ‘Excellent’ | 1 | 138 | 184 |
| 45 | 77 | ‘Male’ | 71 | ‘Simmons’ | ‘VA Hospital’ | ‘Excellent’ | 0 | 124 | 181 |
| 30 | 91 | ‘Female’ | 70 | ‘Foster’ | ‘St. Mary’s Medical Center’ | ‘Fair’ | 0 | 130 | 124 |
| 48 | 79 | ‘Male’ | 71 | ‘Gonzales’ | ‘County General Hospital’ | ‘Good’ | 0 | 123 | 174 |
| 48 | 73 | ‘Female’ | 66 | ‘Bryant’ | ‘County General Hospital’ | ‘Excellent’ | 0 | 129 | 134 |
| 25 | 99 | ‘Male’ | 69 | ‘Alexander’ | ‘County General Hospital’ | ‘Good’ | 1 | 128 | 171 |
| 44 | 92 | ‘Male’ | 69 | ‘Russell’ | ‘VA Hospital’ | ‘Good’ | 1 | 124 | 188 |
| 49 | 74 | ‘Male’ | 70 | ‘Griffin’ | ‘County General Hospital’ | ‘Fair’ | 0 | 119 | 186 |
| 45 | 93 | ‘Male’ | 68 | ‘Diaz’ | ‘County General Hospital’ | ‘Good’ | 1 | 136 | 172 |
| 48 | 86 | ‘Male’ | 66 | ‘Hayes’ | ‘County General Hospital’ | ‘Fair’ | 0 | 114 | 177 |
I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.
If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.