Image credit: rawpixel.

This analysis builds a cross-validated lasso regression model to predict Systolic blood pressure, using various numeric and categorical factors: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker.

\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(County General Hospital) + ... \\ + \theta(VA Hospital) + \theta(HealthFair) + \theta(Smoker) + b \\ \end{equation}\]

Lasso regression is a powerful method which allows you to control a model’s robustness. A highly robust (regularized) model is able to cope with larger variations in the out-of-sample dataset, by maintaining greater accuracy. A tutorial video on lasso regression can be found here. Alternatively, the creator’s own writings about lasso regression can be read in this book.

This is just one of the five machine learning modeling guides you can find here.

Overview

  1. Load the patient dataset.

  2. Use Linear Regresion to predict using the following:
  • Age
  • Gender
  • Height
  • Weight
  • Smoker
  • Location
  • SelfAssessedHealthStatus
  • Systolic blood pressure (Target)

Such that:

\[\begin{equation} \color{red}{\hat{Systolic}} = \theta_1(Age) + \theta_2(Gender) + \theta_3(Height) + \theta_4(Weight) ... \\ ...+ \theta_5(Smoker) + \theta_6(Location) + \theta_7(SelfAssessedHealthStatus) + b \\ \end{equation}\]
  1. Use lasso regression with 10-fold cross-validation to identify useful predictors.

In this project, we’ll be using lasso regularization, adjusting the complexity of the model using the following:

\[ J(\theta) \approx MSE + \lambda C \] Where \(\lambda\) is the amount of regularization and C is model complexity.

  1. Identify the top two predictors.

  2. What is the \(\lambda\) will you use in your preferred model?

Pre-Modeling

Load Required Packages

There are two ways to load the required packages.

  1. Install pacman using the following code.
# #install.packages("pacman")
# library("pacman")
  1. Or use this function and see if it works for you. If not, again, try the code above.
#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(base64enc, ggplot2, kableExtra)

Load Data

The dataset we will be loading appears as:

Document Preview

Document Preview

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

Preview Data

Examine the data structure.

#Preview structure
str(patients)
## 'data.frame':    100 obs. of  10 variables:
##  $ Age                     : int  38 43 38 40 49 46 33 40 28 31 ...
##  $ Diastolic               : int  93 77 83 75 80 70 88 82 78 86 ...
##  $ Gender                  : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
##  $ Height                  : int  71 69 64 67 64 68 64 68 68 66 ...
##  $ LastName                : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
##  $ Location                : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
##  $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
##  $ Smoker                  : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Systolic                : int  124 109 125 117 122 121 130 115 115 118 ...
##  $ Weight                  : int  176 163 131 133 119 142 142 180 183 132 ...

Examine the top 5 rows.

#Preview top 5 rows
head(patients, n=5)
##   Age Diastolic   Gender Height   LastName                    Location
## 1  38        93   'Male'     71    'Smith'   'County General Hospital'
## 2  43        77   'Male'     69  'Johnson'               'VA Hospital'
## 3  38        83 'Female'     64 'Williams' 'St. Mary's Medical Center'
## 4  40        75 'Female'     67    'Jones'               'VA Hospital'
## 5  49        80 'Female'     64    'Brown'   'County General Hospital'
##   SelfAssessedHealthStatus Smoker Systolic Weight
## 1              'Excellent'      1      124    176
## 2                   'Fair'      0      109    163
## 3                   'Good'      0      125    131
## 4                   'Fair'      0      117    133
## 5                   'Good'      0      122    119

Preprocessing

Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.

Now remove from patients table the unwanted columns of Diastolic and LastName.

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

Split dataframes into categorical and numeric

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

One-hot encode categorical columns.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

Standardize numerics, and then recombine ordered dataframes.

An additional interesting discussion on when to standardize is here

scaled_numericdf <- scale(df_numeric)

Recombine the forked categorical and numeric dataframes together using column bind.

df <- cbind(scaled_numericdf, df_categorical)

Data Exploration

Histograms

Plot histogram of numeric columns. For bin specification, see here.

For plotting multiples, see here, and consider cowplot.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

HeightPlot <- ggplot(data=df, aes(x=Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

WeightPlot <- ggplot(data=df, aes(x=Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3)

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

Rename column headers for easier interpretation and reference.

names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"
Scatterplots

Scatterplot visualization is an important step of the statistical analysis process, as descriptive statistics can oversimplify your understanding. For further understanding of this, please refer to Anscombe’s quartet.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)

Modeling

Having completed the pre-processing and data exploration phases, we now move onto building a lasso regression model. SO users recommend glmnet over lars as the preferred and more actively maintained lasso regression package.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

Function glmnet runs on a matrix, not a dataframe. Therefore, the df must be first converted.

#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df) 

df_matrix <- as.matrix(df)

#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]

Now it’s time to fit the classification model.

fit <- glmnet(X, Y)

We can visualize the coefficients with the following plot.

plot(fit, label = TRUE)

Each curve is a factor, such as Weight or Height. Intreptting, this plot, we can see that as regularization \(\lambda \uparrow\) increases (leftward on the x-axis), coefficients shrink to zero and drop away from our model. The Coefficient value (y-axis) represents that factor’s influence on Systolic at that level of regularization. A positive coefficient value increases Systolic, while a negative coefficient decreases Systolic.

Optimize model

We can apply cross validation (10-fold) to randomly partition the data into 10 different training and testing datasets. When we plot regularization’s influence on MSE (mean square error), we should expect a positive relationship. Note: For reproducibility, we’ll set the random seed to 123.

set.seed(123)
cvfit = cv.glmnet(X, Y)
plot(cvfit)

As expected, as \(\lambda \uparrow\), MSE\(\uparrow\). The upper and lower standard deviations for \(\lambda\) are indicated by the vertical curves. It is recommended that your selected \(\lambda\) fit within their range.

In seeking to minimize cross-validated error, we would use the following function:

cvfit$lambda.min

where \(\lambda\) = 0.0431221 is the minimized cross-validated error. Alternatively, we could use the following:

cvfit$lambda.1se

Where \(\lambda\) = 0.2771919 is the largest \(\lambda\) at which the MSE is within one standard error of the minimal MSE.

We can now print the coefficent \(\theta\) values from our model.

MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)              -0.39871736
## Age                       0.04952344
## Height                    0.04157106
## Weight                    .         
## Male                      .         
## County General Hospital   0.03032041
## St.Mary's Medical Center  .         
## VA Hospital              -0.09250336
## HealthExcellent           .         
## HealthFair               -0.28799287
## HealthGood                .         
## HealthPoor                .         
## Smoker                    1.36563993

For easier access, let’s produce a table of the coefficient names and their values.

From the above, we see the factors with the greatest impact on Systolic are Smoker and HealthFair. If you are a smoker, \(\hat{Systolic}\) increases by 1.37, and if your health is fair, \(\hat{Systolic}\) decreases by 0.29.

Our final minimized cross-validated error model is:

\[\begin{equation} \hat{Systolic} = 0.05(Age) + 0.04(Height) + 0.03(County General Hospital) + ... \\ - 0.09(VA Hospital) - 0.29(HealthFair) + 1.37(Smoker) -0.40 \\ \end{equation}\]

Of note, the factors Weight, Gender, St.Mary’s location, HealthExcellent, HealthGood and HealthPoor were not contributing factors in this final model.

Results

Linear Regression was used to construct a model for predicting Systolic blood pressure, given 13 numeric and categorical factors. A robust, 10-fold cross-validated model was then using lasso regularization, with \(\lambda\) set to minimize cross-validated error (\(\lambda\) = 0.0431221). The top two predictors in this regularized model composed of six factors were Smoker status and HealthFair.

Thank you for reading, and happy regressing!

R

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

#Preview structure
str(patients)

#Preview top 5 rows
head(patients, n=5)

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

scaled_numericdf <- scale(df_numeric)

df <- cbind(scaled_numericdf, df_categorical)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

HeightPlot <- ggplot(data=df, aes(x=Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

WeightPlot <- ggplot(data=df, aes(x=Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3)

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df) 

df_matrix <- as.matrix(df)

#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]

fit <- glmnet(X, Y)

plot(fit, label = TRUE)

set.seed(123)
cvfit = cv.glmnet(X, Y)

plot(cvfit)

cvfit$lambda.min

cvfit$lambda.1se

MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)

MATLAB

%Clear previous variables, wipe screen, close windows

clear all

clc

% close all

% https://www.mathworks.com/help/matlab/matlab_prog/create-a-table.html

load patients;

%Target———————————————————–

Y = Systolic;

%Standardize numericals——————————————-

XNumeric = [Age Height Weight];

XNumeric_scaled = zscore(XNumeric);

%One-hot encode categoricals————————————–

%Because these attributes are single-columns containing many values, we

%need to break them into binary attributes, one for each value.

Gender = nominal(Gender);

GenderCateg = dummyvar(Gender);

Location = nominal(Location);

LocationCateg = dummyvar(Location);

SelfAssessedHealthStatus = nominal(SelfAssessedHealthStatus);

SelfAssessedHealthStatusCateg = dummyvar(SelfAssessedHealthStatus);

% Bring Categorical together

%Now that we’ve broken each attribute value into a seperate binary vector,

%we need to bring them all back together into a single matrix.

XCateg = [GenderCateg LocationCateg SelfAssessedHealthStatusCateg Smoker];

% Merge numerical with categorical matrices———————-

XAll = [XNumeric_scaled XCateg];

%Lasso===========================================================

%We need to determine the number of k-folds and alpha value.

%With those values set, we can run our lasso linear regression.

%[B, FitInfo] = lasso(X, Y,Name, Value)

% Set cross validation k-fold, k

kfold = 10;

% ‘Alpha’, alpha value, where alpha = 1 is lasso, and = 0.00001 approaches

% ridge

alpha = 1;

% Don’t set lambda. It’s a vector, not a scalar.

% Default lambda count (steps) = 100

[B FitInfo] = lasso(XAll, Y, ‘CV’, kfold, ‘Alpha’, alpha);

% Lasso Plot of Coefficients=====================================

lassoPlot(B, FitInfo, ‘PredictorNames’, {‘Age’, ‘Height’, ‘Weight’,…

'Female', 'Male',...

'County General Hospital', 'St Marys Medical Center', 'VA Hospital',...

'Excellent', 'Fair', 'Good', 'Poor',...

'Smoker'},...

'PlotType', 'lambda',...

'XScale', 'log'),...

ylabel('theta'),...

xlabel('lambda')

% Cross-validated Deviance of Lasso Plot==========================

%Product an extra graph to display cross-validation

lassoPlot(B, FitInfo, ‘PlotType’, ‘CV’);

% Theta vs Predictors=============================================

%Product an extra graph to display theta vs predictors

figure, pcolor(B), xlabel(‘Theta’), ylabel(‘Predictors’)

% Interpretation————————————————-

% Identify number of nonzero coefficients are minimum deviance plus one

% standard deviation.

indx = FitInfo.Index1SE;

B0 = B(:,indx);

nonzeros = sum(B0 ~= 0)

Python

coding: utf-8

In[68]:

import numpy as np

import pandas as pd

import scipy.stats as stats

import matplotlib.pyplot as plt

import sklearn

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn import datasets

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

from sklearn import svm

from sklearn import linear_model

from sklearn.linear_model import Lasso

from sklearn import preprocessing

In[69]:

Load data

Because there is header column, set header=0

patients = pd.read_csv(“C:/tmp/patients.csv”, header=0)

Backup patients, just in case we need it later

patientsBackup = patients

First split matrix into y (dependent) and x (independent)

Remember, Python is 0-offset! The “3rd” entry is at position 2.

patientsY = Systolic

patientsX = Everything else exluding LastName and Systolic

In[61]:

split dependent variable and independent variables

patientsY = patients[“Systolic”]

patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

Smoker is not pulled with the other categorical data, so this next code line > was added

patientsXBinary = patients[[“Smoker”]]

In[62]:

Standardize

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)

Standardizing removed the header from the row. I need to fix this.

In[63]:

Return only object datatypes (non-numeric here)

categoriesX = patientsX.select_dtypes(include=[object]).copy()

categoriesX.head()

One-Hot Encoding

As said in this terrific one-hot > tutorial:

There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.

There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.

LastName is not to be included in the linear regression.

In[34]:

categoriesX_onehot = categoriesX.copy()

categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, > “Location”, “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])

categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])

Return results

print(categoriesX_onehot.head());

print(categoriesXBinary_onehot.head());

Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge > on columns.

In[35]:

print(“categoriesX_onehot is:”, type(categoriesX_onehot))

print(categoriesX_onehot.shape)

print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))

print(categoriesXBinary_onehot.shape)

print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))

print(patientsXNumeric_scaled.shape)

patientsXNumeric_scaled is an array. I used > https://stackoverflow.com/questions/20763012/creating-a-pandas-dataframe- > from-a-numpy-array-how-do-i-specify-the-index-colum to convert it to a > dataframe.

In[36]:

patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)

In[37]:

print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))

In[38]:

print(patientsXNumeric_scaleddf.head())

Looks better, but it still needs column names.

In[39]:

patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]

Now we bring all the columns back together as one dataframe.

In[40]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

In[40]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

lasso = linear_model.Lasso()

print(“Cross Val Score:” + str(cross_val_score(lasso, patientsXAll, > patientsY, cv=10)))

Markdown

To view this entire document’s markdown code, click here.

Data

If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.

Age Diastolic Gender Height LastName Location SelfAssessedHealthStatus Smoker Systolic Weight
38 93 ‘Male’ 71 ‘Smith’ ‘County General Hospital’ ‘Excellent’ 1 124 176
43 77 ‘Male’ 69 ‘Johnson’ ‘VA Hospital’ ‘Fair’ 0 109 163
38 83 ‘Female’ 64 ‘Williams’ ‘St. Mary’s Medical Center’ ‘Good’ 0 125 131
40 75 ‘Female’ 67 ‘Jones’ ‘VA Hospital’ ‘Fair’ 0 117 133
49 80 ‘Female’ 64 ‘Brown’ ‘County General Hospital’ ‘Good’ 0 122 119
46 70 ‘Female’ 68 ‘Davis’ ‘St. Mary’s Medical Center’ ‘Good’ 0 121 142
33 88 ‘Female’ 64 ‘Miller’ ‘VA Hospital’ ‘Good’ 1 130 142
40 82 ‘Male’ 68 ‘Wilson’ ‘VA Hospital’ ‘Good’ 0 115 180
28 78 ‘Male’ 68 ‘Moore’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 115 183
31 86 ‘Female’ 66 ‘Taylor’ ‘County General Hospital’ ‘Excellent’ 0 118 132
45 77 ‘Female’ 68 ‘Anderson’ ‘County General Hospital’ ‘Excellent’ 0 114 128
42 68 ‘Female’ 66 ‘Thomas’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 115 137
25 74 ‘Male’ 71 ‘Jackson’ ‘VA Hospital’ ‘Poor’ 0 127 174
39 95 ‘Male’ 72 ‘White’ ‘VA Hospital’ ‘Excellent’ 1 130 202
36 79 ‘Female’ 65 ‘Harris’ ‘St. Mary’s Medical Center’ ‘Good’ 0 114 129
48 92 ‘Male’ 71 ‘Martin’ ‘VA Hospital’ ‘Good’ 1 130 181
32 95 ‘Male’ 69 ‘Thompson’ ‘St. Mary’s Medical Center’ ‘Excellent’ 1 124 191
27 79 ‘Female’ 69 ‘Garcia’ ‘VA Hospital’ ‘Fair’ 1 123 131
37 77 ‘Male’ 70 ‘Martinez’ ‘County General Hospital’ ‘Good’ 0 119 179
50 76 ‘Male’ 68 ‘Robinson’ ‘County General Hospital’ ‘Good’ 0 125 172
48 75 ‘Female’ 65 ‘Clark’ ‘VA Hospital’ ‘Excellent’ 0 121 133
39 79 ‘Female’ 64 ‘Rodriguez’ ‘VA Hospital’ ‘Fair’ 0 123 117
41 88 ‘Female’ 62 ‘Lewis’ ‘VA Hospital’ ‘Fair’ 0 114 137
44 90 ‘Female’ 66 ‘Lee’ ‘County General Hospital’ ‘Fair’ 1 128 146
28 96 ‘Female’ 65 ‘Walker’ ‘County General Hospital’ ‘Good’ 1 129 123
25 77 ‘Male’ 70 ‘Hall’ ‘VA Hospital’ ‘Poor’ 0 114 189
39 80 ‘Female’ 63 ‘Allen’ ‘VA Hospital’ ‘Excellent’ 0 113 143
25 76 ‘Female’ 63 ‘Young’ ‘County General Hospital’ ‘Good’ 0 125 114
36 83 ‘Male’ 68 ‘Hernandez’ ‘County General Hospital’ ‘Poor’ 0 120 166
30 89 ‘Male’ 67 ‘King’ ‘County General Hospital’ ‘Excellent’ 1 127 186
45 92 ‘Female’ 70 ‘Wright’ ‘VA Hospital’ ‘Excellent’ 1 134 126
40 83 ‘Female’ 66 ‘Lopez’ ‘VA Hospital’ ‘Poor’ 0 121 137
25 80 ‘Female’ 64 ‘Hill’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 115 138
47 84 ‘Male’ 70 ‘Scott’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 127 187
44 92 ‘Male’ 71 ‘Green’ ‘County General Hospital’ ‘Good’ 0 121 193
48 83 ‘Female’ 66 ‘Adams’ ‘VA Hospital’ ‘Excellent’ 0 127 137
44 90 ‘Male’ 71 ‘Baker’ ‘VA Hospital’ ‘Good’ 1 136 192
35 85 ‘Female’ 66 ‘Gonzalez’ ‘St. Mary’s Medical Center’ ‘Fair’ 0 117 118
33 90 ‘Male’ 66 ‘Nelson’ ‘St. Mary’s Medical Center’ ‘Good’ 1 124 180
38 74 ‘Female’ 63 ‘Carter’ ‘St. Mary’s Medical Center’ ‘Good’ 0 120 128
39 92 ‘Male’ 71 ‘Mitchell’ ‘County General Hospital’ ‘Fair’ 1 128 164
44 80 ‘Male’ 69 ‘Perez’ ‘VA Hospital’ ‘Excellent’ 0 116 183
44 89 ‘Male’ 70 ‘Roberts’ ‘VA Hospital’ ‘Good’ 1 132 169
37 96 ‘Male’ 70 ‘Turner’ ‘VA Hospital’ ‘Excellent’ 1 137 194
45 89 ‘Male’ 67 ‘Phillips’ ‘VA Hospital’ ‘Good’ 0 117 172
37 77 ‘Female’ 65 ‘Campbell’ ‘County General Hospital’ ‘Fair’ 0 116 135
30 81 ‘Male’ 68 ‘Parker’ ‘VA Hospital’ ‘Poor’ 0 119 182
39 76 ‘Female’ 62 ‘Evans’ ‘County General Hospital’ ‘Good’ 0 123 121
42 83 ‘Male’ 70 ‘Edwards’ ‘County General Hospital’ ‘Excellent’ 0 116 158
42 78 ‘Male’ 67 ‘Collins’ ‘County General Hospital’ ‘Good’ 1 124 179
49 95 ‘Male’ 68 ‘Stewart’ ‘County General Hospital’ ‘Poor’ 1 129 170
44 91 ‘Female’ 62 ‘Sanchez’ ‘St. Mary’s Medical Center’ ‘Good’ 1 130 136
43 91 ‘Female’ 64 ‘Morris’ ‘County General Hospital’ ‘Poor’ 1 132 135
47 86 ‘Female’ 66 ‘Rogers’ ‘VA Hospital’ ‘Excellent’ 0 117 147
50 89 ‘Male’ 72 ‘Reed’ ‘VA Hospital’ ‘Excellent’ 1 129 186
38 79 ‘Female’ 63 ‘Cook’ ‘VA Hospital’ ‘Excellent’ 0 118 124
41 74 ‘Female’ 66 ‘Morgan’ ‘St. Mary’s Medical Center’ ‘Good’ 0 120 134
45 82 ‘Male’ 70 ‘Bell’ ‘St. Mary’s Medical Center’ ‘Good’ 1 138 170
36 76 ‘Male’ 71 ‘Murphy’ ‘VA Hospital’ ‘Good’ 0 117 180
38 81 ‘Female’ 68 ‘Bailey’ ‘St. Mary’s Medical Center’ ‘Good’ 0 113 130
29 77 ‘Female’ 63 ‘Rivera’ ‘County General Hospital’ ‘Excellent’ 0 122 130
28 73 ‘Female’ 65 ‘Cooper’ ‘VA Hospital’ ‘Good’ 0 115 127
30 85 ‘Female’ 67 ‘Richardson’ ‘County General Hospital’ ‘Excellent’ 0 120 141
28 76 ‘Female’ 66 ‘Cox’ ‘County General Hospital’ ‘Good’ 0 117 111
29 80 ‘Female’ 68 ‘Howard’ ‘VA Hospital’ ‘Excellent’ 0 123 134
36 80 ‘Male’ 71 ‘Ward’ ‘St. Mary’s Medical Center’ ‘Good’ 0 123 189
45 79 ‘Female’ 70 ‘Torres’ ‘County General Hospital’ ‘Excellent’ 0 119 137
32 82 ‘Female’ 60 ‘Peterson’ ‘County General Hospital’ ‘Excellent’ 0 110 136
31 79 ‘Female’ 64 ‘Gray’ ‘VA Hospital’ ‘Excellent’ 0 121 130
48 82 ‘Female’ 64 ‘Ramirez’ ‘County General Hospital’ ‘Excellent’ 1 138 137
25 75 ‘Male’ 66 ‘James’ ‘County General Hospital’ ‘Good’ 0 125 186
40 91 ‘Female’ 64 ‘Watson’ ‘VA Hospital’ ‘Fair’ 1 122 127
39 74 ‘Male’ 72 ‘Brooks’ ‘St. Mary’s Medical Center’ ‘Excellent’ 0 120 176
41 78 ‘Female’ 65 ‘Kelly’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 117 127
33 85 ‘Female’ 67 ‘Sanders’ ‘St. Mary’s Medical Center’ ‘Excellent’ 1 125 115
31 84 ‘Male’ 72 ‘Price’ ‘VA Hospital’ ‘Fair’ 1 124 178
35 75 ‘Female’ 64 ‘Bennett’ ‘County General Hospital’ ‘Fair’ 0 121 131
32 78 ‘Male’ 68 ‘Wood’ ‘St. Mary’s Medical Center’ ‘Poor’ 0 118 183
42 81 ‘Male’ 66 ‘Barnes’ ‘County General Hospital’ ‘Excellent’ 0 120 194
48 79 ‘Female’ 64 ‘Ross’ ‘VA Hospital’ ‘Good’ 0 118 126
34 85 ‘Male’ 68 ‘Henderson’ ‘St. Mary’s Medical Center’ ‘Good’ 0 118 186
39 79 ‘Male’ 69 ‘Coleman’ ‘VA Hospital’ ‘Excellent’ 0 122 188
28 82 ‘Male’ 69 ‘Jenkins’ ‘County General Hospital’ ‘Good’ 1 134 189
29 80 ‘Female’ 64 ‘Perry’ ‘St. Mary’s Medical Center’ ‘Good’ 0 131 120
32 80 ‘Female’ 63 ‘Powell’ ‘VA Hospital’ ‘Excellent’ 0 113 132
39 92 ‘Male’ 68 ‘Long’ ‘County General Hospital’ ‘Good’ 1 125 182
37 92 ‘Female’ 65 ‘Patterson’ ‘County General Hospital’ ‘Poor’ 1 135 120
49 96 ‘Female’ 63 ‘Hughes’ ‘County General Hospital’ ‘Good’ 1 128 123
31 87 ‘Female’ 66 ‘Flores’ ‘VA Hospital’ ‘Good’ 1 123 141
37 81 ‘Female’ 65 ‘Washington’ ‘St. Mary’s Medical Center’ ‘Good’ 0 122 129
38 90 ‘Male’ 68 ‘Butler’ ‘County General Hospital’ ‘Excellent’ 1 138 184
45 77 ‘Male’ 71 ‘Simmons’ ‘VA Hospital’ ‘Excellent’ 0 124 181
30 91 ‘Female’ 70 ‘Foster’ ‘St. Mary’s Medical Center’ ‘Fair’ 0 130 124
48 79 ‘Male’ 71 ‘Gonzales’ ‘County General Hospital’ ‘Good’ 0 123 174
48 73 ‘Female’ 66 ‘Bryant’ ‘County General Hospital’ ‘Excellent’ 0 129 134
25 99 ‘Male’ 69 ‘Alexander’ ‘County General Hospital’ ‘Good’ 1 128 171
44 92 ‘Male’ 69 ‘Russell’ ‘VA Hospital’ ‘Good’ 1 124 188
49 74 ‘Male’ 70 ‘Griffin’ ‘County General Hospital’ ‘Fair’ 0 119 186
45 93 ‘Male’ 68 ‘Diaz’ ‘County General Hospital’ ‘Good’ 1 136 172
48 86 ‘Male’ 66 ‘Hayes’ ‘County General Hospital’ ‘Fair’ 0 114 177

Publications

Videos

I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.

If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.