This analysis builds a cross-validated lasso regression model to predict Systolic blood pressure, using various numeric and categorical factors: Age, Height, Weight, Gender, Hospital Location (CGH - County General Hospital, SMMC - St. Mary’s Medical Center, VA - VA Hospital), Status of Health (Excellent, Fair, Good, Poor), and Smoker.

\[\begin{equation} \hat{Systolic} = \theta(Age) + \theta(Height) + \theta(County General Hospital) + ... \\ + \theta(VA Hospital) + \theta(HealthFair) + \theta(Smoker) + b \\ \end{equation}\]

Lasso regression is a powerful method which allows you to control a model’s robustness. A highly robust (regularized) model is able to cope with larger variations in the out-of-sample dataset, by maintaining greater accuracy. A tutorial video on lasso regression can be found here. Alternatively, the creator’s own writings about lasso regression can be read in this book.

This is just one of the five machine learning modeling guides you can find here.

Overview

Load the patient dataset.
Use Linear Regresion to predict using the following:

Age
Gender
Height
Weight
Smoker
Location
SelfAssessedHealthStatus
Systolic blood pressure (Target)

Such that:

\[\begin{equation} \color{red}{\hat{Systolic}} = \theta_1(Age) + \theta_2(Gender) + \theta_3(Height) + \theta_4(Weight) ... \\ ...+ \theta_5(Smoker) + \theta_6(Location) + \theta_7(SelfAssessedHealthStatus) + b \\ \end{equation}\]

Use lasso regression with 10-fold cross-validation to identify useful predictors.

In this project, we’ll be using lasso regularization, adjusting the complexity of the model using the following:

\[ J(\theta) \approx MSE + \lambda C \] Where \(\lambda\) is the amount of regularization and C is model complexity.

Identify the top two predictors.
What is the \(\lambda\) will you use in your preferred model?

Pre-Modeling

Load Required Packages

There are two ways to load the required packages.

Install pacman using the following code.

# #install.packages("pacman")
# library("pacman")

Or use this function and see if it works for you. If not, again, try the code above.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")

## Loading required package: pacman

pacman::p_load(base64enc, ggplot2, kableExtra)

Load Data

The dataset we will be loading appears as:

Document Preview

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

Preview Data

Examine the data structure.

#Preview structure
str(patients)

## 'data.frame':    100 obs. of  10 variables:
##  $ Age                     : int  38 43 38 40 49 46 33 40 28 31 ...
##  $ Diastolic               : int  93 77 83 75 80 70 88 82 78 86 ...
##  $ Gender                  : Factor w/ 2 levels "'Female'","'Male'": 2 2 1 1 1 1 1 2 2 1 ...
##  $ Height                  : int  71 69 64 67 64 68 64 68 68 66 ...
##  $ LastName                : Factor w/ 100 levels "'Adams'","'Alexander'",..: 84 45 96 46 11 22 55 97 57 86 ...
##  $ Location                : Factor w/ 3 levels "'County General Hospital'",..: 1 3 2 3 1 2 3 3 2 1 ...
##  $ SelfAssessedHealthStatus: Factor w/ 4 levels "'Excellent'",..: 1 2 3 2 3 3 3 3 1 1 ...
##  $ Smoker                  : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Systolic                : int  124 109 125 117 122 121 130 115 115 118 ...
##  $ Weight                  : int  176 163 131 133 119 142 142 180 183 132 ...

Examine the top 5 rows.

#Preview top 5 rows
head(patients, n=5)

##   Age Diastolic   Gender Height   LastName                    Location
## 1  38        93   'Male'     71    'Smith'   'County General Hospital'
## 2  43        77   'Male'     69  'Johnson'               'VA Hospital'
## 3  38        83 'Female'     64 'Williams' 'St. Mary's Medical Center'
## 4  40        75 'Female'     67    'Jones'               'VA Hospital'
## 5  49        80 'Female'     64    'Brown'   'County General Hospital'
##   SelfAssessedHealthStatus Smoker Systolic Weight
## 1              'Excellent'      1      124    176
## 2                   'Fair'      0      109    163
## 3                   'Good'      0      125    131
## 4                   'Fair'      0      117    133
## 5                   'Good'      0      122    119

Preprocessing

Each attribute contains 100 observations; there are no missing values. Therefore, we do not need to fill missing values with mean/mode, or drop any columns/rows.

Now remove from patients table the unwanted columns of Diastolic and LastName.

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

Split dataframes into categorical and numeric

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

One-hot encode categorical columns.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

Standardize numerics, and then recombine ordered dataframes.

An additional interesting discussion on when to standardize is here

scaled_numericdf <- scale(df_numeric)

Recombine the forked categorical and numeric dataframes together using column bind.

df <- cbind(scaled_numericdf, df_categorical)

Data Exploration

Histograms

Plot histogram of numeric columns. For bin specification, see here.

For plotting multiples, see here, and consider cowplot.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

HeightPlot <- ggplot(data=df, aes(x=Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

WeightPlot <- ggplot(data=df, aes(x=Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3)

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

Rename column headers for easier interpretation and reference.

names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

Scatterplots

Scatterplot visualization is an important step of the statistical analysis process, as descriptive statistics can oversimplify your understanding. For further understanding of this, please refer to Anscombe’s quartet.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)

Modeling

Having completed the pre-processing and data exploration phases, we now move onto building a lasso regression model. SO users recommend glmnet over lars as the preferred and more actively maintained lasso regression package.

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

Function glmnet runs on a matrix, not a dataframe. Therefore, the df must be first converted.

#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df) 

df_matrix <- as.matrix(df)

#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]

Now it’s time to fit the classification model.

fit <- glmnet(X, Y)

We can visualize the coefficients with the following plot.

plot(fit, label = TRUE)

Each curve is a factor, such as Weight or Height. Intreptting, this plot, we can see that as regularization \(\lambda \uparrow\) increases (leftward on the x-axis), coefficients shrink to zero and drop away from our model. The Coefficient value (y-axis) represents that factor’s influence on Systolic at that level of regularization. A positive coefficient value increases Systolic, while a negative coefficient decreases Systolic.

Optimize model

We can apply cross validation (10-fold) to randomly partition the data into 10 different training and testing datasets. When we plot regularization’s influence on MSE (mean square error), we should expect a positive relationship. Note: For reproducibility, we’ll set the random seed to 123.

set.seed(123)
cvfit = cv.glmnet(X, Y)

plot(cvfit)

As expected, as \(\lambda \uparrow\), MSE\(\uparrow\). The upper and lower standard deviations for \(\lambda\) are indicated by the vertical curves. It is recommended that your selected \(\lambda\) fit within their range.

In seeking to minimize cross-validated error, we would use the following function:

cvfit$lambda.min

where \(\lambda\) = 0.0431221 is the minimized cross-validated error. Alternatively, we could use the following:

cvfit$lambda.1se

Where \(\lambda\) = 0.2771919 is the largest \(\lambda\) at which the MSE is within one standard error of the minimal MSE.

We can now print the coefficent \(\theta\) values from our model.

MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                                    1
## (Intercept)              -0.39871736
## Age                       0.04952344
## Height                    0.04157106
## Weight                    .         
## Male                      .         
## County General Hospital   0.03032041
## St.Mary's Medical Center  .         
## VA Hospital              -0.09250336
## HealthExcellent           .         
## HealthFair               -0.28799287
## HealthGood                .         
## HealthPoor                .         
## Smoker                    1.36563993

For easier access, let’s produce a table of the coefficient names and their values.

From the above, we see the factors with the greatest impact on Systolic are Smoker and HealthFair. If you are a smoker, \(\hat{Systolic}\) increases by 1.37, and if your health is fair, \(\hat{Systolic}\) decreases by 0.29.

Our final minimized cross-validated error model is:

\[\begin{equation} \hat{Systolic} = 0.05(Age) + 0.04(Height) + 0.03(County General Hospital) + ... \\ - 0.09(VA Hospital) - 0.29(HealthFair) + 1.37(Smoker) -0.40 \\ \end{equation}\]

Of note, the factors Weight, Gender, St.Mary’s location, HealthExcellent, HealthGood and HealthPoor were not contributing factors in this final model.

Results

Linear Regression was used to construct a model for predicting Systolic blood pressure, given 13 numeric and categorical factors. A robust, 10-fold cross-validated model was then using lasso regularization, with \(\lambda\) set to minimize cross-validated error (\(\lambda\) = 0.0431221). The top two predictors in this regularized model composed of six factors were Smoker status and HealthFair.

Thank you for reading, and happy regressing!

R

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(base64enc, ggplot2, kableExtra)

patients <- read.csv("C:/Users/G/Google Drive/aStThomas/6MachineLearning/Assignments/R/2 linear regression/patients.csv", header = TRUE)

#Preview structure
str(patients)

#Preview top 5 rows
head(patients, n=5)

patientsOriginal <- patients

df <- patients[-c(2, 5)] #deletes columns 2 and 5

df_categorical <- df[,c("Gender","Location","SelfAssessedHealthStatus", "Smoker")]
df_numeric <- df[,c("Age","Height", "Systolic", "Weight")]

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(dummies)

#Convert Location to one-hot
#sep is used to add seperation in header title
df_categorical <- dummy.data.frame(df_categorical, names=c("Location", "SelfAssessedHealthStatus", "Gender"), sep="-")

df_categorical <- df_categorical[-c(1)] #deletes columns Female, such as Female = 0 and Male = 1

scaled_numericdf <- scale(df_numeric)

df <- cbind(scaled_numericdf, df_categorical)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

HeightPlot <- ggplot(data=df, aes(x=Height)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

WeightPlot <- ggplot(data=df, aes(x=Weight)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="blue",
                  alpha = .3)

SystolicPlot <- ggplot(data=df, aes(x=Systolic)) +
                  geom_histogram(bins = 20,
                  col="black",
                  fill="green",
                  alpha = .3)

grid.arrange(AgePlot, HeightPlot, WeightPlot, SystolicPlot, nrow=2)

names(df)[5] <- "Male"
names(df)[6] <- "County General Hospital"
names(df)[7] <- "St.Mary's Medical Center"
names(df)[8] <- "VA Hospital"
names(df)[9] <- "HealthExcellent"
names(df)[10] <- "HealthFair"
names(df)[11] <- "HealthGood"
names(df)[12] <- "HealthPoor"
names(df)[13] <- "Smoker"

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, gridExtra)

AgePlot <- ggplot(data=df, aes(x=Age, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

WeightPlot <- ggplot(data=df, aes(x=Weight, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

HeightPlot <- ggplot(data=df, aes(x=Height, y=Systolic)) + 
                    geom_point(size=0.5) +
                    geom_smooth(method=lm)

grid.arrange(AgePlot, HeightPlot, WeightPlot, nrow=2)

#Check for and install required packages, using pacman
if (!require("pacman")) install.packages("pacman")
pacman::p_load(glmnet)

#There are several methods available, but I prefer the classic.
#https://stackoverflow.com/questions/16518428/right-way-to-convert-data-frame-to-a-numeric-matrix-when-df-also-contains-strin
#df_matrix <- data.matrix(df) 

df_matrix <- as.matrix(df)

#And split into x and y
X <- df_matrix[,c(1,2,4,5,6,7,8,9,10,11,12,13)]
Y <- df_matrix[,c(3)]

fit <- glmnet(X, Y)

plot(fit, label = TRUE)

set.seed(123)
cvfit = cv.glmnet(X, Y)

plot(cvfit)

cvfit$lambda.min

cvfit$lambda.1se

MinCVErrorModel <- coef(cvfit, s = "lambda.min")
print(MinCVErrorModel)

MATLAB

%Clear previous variables, wipe screen, close windows

clear all

clc

% close all

% https://www.mathworks.com/help/matlab/matlab_prog/create-a-table.html

load patients;

%Target———————————————————–

Y = Systolic;

%Standardize numericals——————————————-

XNumeric = [Age Height Weight];

XNumeric_scaled = zscore(XNumeric);

%One-hot encode categoricals————————————–

%Because these attributes are single-columns containing many values, we

%need to break them into binary attributes, one for each value.

Gender = nominal(Gender);

GenderCateg = dummyvar(Gender);

Location = nominal(Location);

LocationCateg = dummyvar(Location);

SelfAssessedHealthStatus = nominal(SelfAssessedHealthStatus);

SelfAssessedHealthStatusCateg = dummyvar(SelfAssessedHealthStatus);

% Bring Categorical together

%Now that we’ve broken each attribute value into a seperate binary vector,

%we need to bring them all back together into a single matrix.

XCateg = [GenderCateg LocationCateg SelfAssessedHealthStatusCateg Smoker];

% Merge numerical with categorical matrices———————-

XAll = [XNumeric_scaled XCateg];

%Lasso===========================================================

%We need to determine the number of k-folds and alpha value.

%With those values set, we can run our lasso linear regression.

%[B, FitInfo] = lasso(X, Y,Name, Value)

% Set cross validation k-fold, k

kfold = 10;

% ‘Alpha’, alpha value, where alpha = 1 is lasso, and = 0.00001 approaches

% ridge

alpha = 1;

% Don’t set lambda. It’s a vector, not a scalar.

% Default lambda count (steps) = 100

[B FitInfo] = lasso(XAll, Y, ‘CV’, kfold, ‘Alpha’, alpha);

% Lasso Plot of Coefficients=====================================

lassoPlot(B, FitInfo, ‘PredictorNames’, {‘Age’, ‘Height’, ‘Weight’,…
'Female', 'Male',...

'County General Hospital', 'St Marys Medical Center', 'VA Hospital',...

'Excellent', 'Fair', 'Good', 'Poor',...

'Smoker'},...

'PlotType', 'lambda',...

'XScale', 'log'),...

ylabel('theta'),...

xlabel('lambda')
% Cross-validated Deviance of Lasso Plot==========================

%Product an extra graph to display cross-validation

lassoPlot(B, FitInfo, ‘PlotType’, ‘CV’);

% Theta vs Predictors=============================================

%Product an extra graph to display theta vs predictors

figure, pcolor(B), xlabel(‘Theta’), ylabel(‘Predictors’)

% Interpretation————————————————-

% Identify number of nonzero coefficients are minimum deviance plus one

% standard deviation.

indx = FitInfo.Index1SE;

B0 = B(:,indx);

nonzeros = sum(B0 ~= 0)

Python

coding: utf-8

In[68]:

import numpy as np

import pandas as pd

import scipy.stats as stats

import matplotlib.pyplot as plt

import sklearn

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn import datasets

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

from sklearn import svm

from sklearn import linear_model

from sklearn.linear_model import Lasso

from sklearn import preprocessing

In[69]:

Load data

Because there is header column, set header=0

patients = pd.read_csv(“C:/tmp/patients.csv”, header=0)

Backup patients, just in case we need it later

patientsBackup = patients
First split matrix into y (dependent) and x (independent)
Remember, Python is 0-offset! The “3rd” entry is at position 2.

patientsY = Systolic

patientsX = Everything else exluding LastName and Systolic

In[61]:

split dependent variable and independent variables

patientsY = patients[“Systolic”]

patientsX = patients[[“Age”, “Gender”, “Height”, “Location”, > “SelfAssessedHealthStatus”, “Weight”]]

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

Smoker is not pulled with the other categorical data, so this next code line > was added

patientsXBinary = patients[[“Smoker”]]

In[62]:

Standardize

patientsXNumeric = patients[[“Age”, “Height”, “Weight”]]

patientsXNumeric_scaled = preprocessing.scale(patientsXNumeric, axis=0)

Standardizing removed the header from the row. I need to fix this.

In[63]:

Return only object datatypes (non-numeric here)

categoriesX = patientsX.select_dtypes(include=[object]).copy()

categoriesX.head()

One-Hot Encoding

As said in this terrific one-hot > tutorial:

There are many libraries out there that support one-hot encoding but the > simplest one is using pandas’ .get_dummies() method.

There are mainly three arguments important here, the first one is the > DataFrame you want to encode on, second being the columns argument which lets > you specify the columns you want to do encoding on, and third, the prefix > argument which lets you specify the prefix for the new columns that will be > created after encoding.

LastName is not to be included in the linear regression.

In[34]:

categoriesX_onehot = categoriesX.copy()

categoriesX_onehot = pd.get_dummies(categoriesX, columns=[“Gender”, > “Location”, “SelfAssessedHealthStatus”], prefix = [“Gender”, “Location”, > “SelfAssessedHealthStatus”])

categoriesXBinary_onehot = pd.get_dummies(patientsXBinary, columns=[“Smoker”], > prefix = [“Smoker”])

Return results

print(categoriesX_onehot.head());

print(categoriesXBinary_onehot.head());

Now that one-hot encoding has split the categorical attributes into many > dummy attributes, they must be concatenated back together. This can be done via > pandas’ .concat() method. The axis argument is set to 1 as you want to merge > on columns.

In[35]:

print(“categoriesX_onehot is:”, type(categoriesX_onehot))

print(categoriesX_onehot.shape)

print(“categoriesXBinary_onehot is:”, type(categoriesXBinary_onehot))

print(categoriesXBinary_onehot.shape)

print(“patientsXNumeric_scaled is:”, type(patientsXNumeric_scaled))

print(patientsXNumeric_scaled.shape)

patientsXNumeric_scaled is an array. I used > https://stackoverflow.com/questions/20763012/creating-a-pandas-dataframe- > from-a-numpy-array-how-do-i-specify-the-index-colum to convert it to a > dataframe.

In[36]:

patientsXNumeric_scaleddf = pd.DataFrame(patientsXNumeric_scaled)

In[37]:

print(“patientsXNumeric_scaleddf is:”, type(patientsXNumeric_scaleddf))

In[38]:

print(patientsXNumeric_scaleddf.head())

Looks better, but it still needs column names.

In[39]:

patientsXNumeric_scaleddf.columns = [“Age”, “Height”, “Weight”]

Now we bring all the columns back together as one dataframe.

In[40]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

In[40]:

Bring them back together

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric, > categoriesXBinary_onehot], axis=1)

patientsXAll = pd.concat([categoriesX_onehot, patientsXNumeric_scaleddf, > categoriesXBinary_onehot], axis=1)

top 3 rows

print(patientsXAll.head(3))

lasso = linear_model.Lasso()

print(“Cross Val Score:” + str(cross_val_score(lasso, patientsXAll, > patientsY, cv=10)))

Markdown

To view this entire document’s markdown code, click here.

Data

If you don’t have the dataset, copy the table below, and after pasting it into Excel, save it as a comma seperated file named patients.csv in your preferred directory.

Age	Diastolic	Gender	Height	LastName	Location	SelfAssessedHealthStatus	Smoker	Systolic	Weight
38	93	‘Male’	71	‘Smith’	‘County General Hospital’	‘Excellent’	1	124	176
43	77	‘Male’	69	‘Johnson’	‘VA Hospital’	‘Fair’	0	109	163
38	83	‘Female’	64	‘Williams’	‘St. Mary’s Medical Center’	‘Good’	0	125	131
40	75	‘Female’	67	‘Jones’	‘VA Hospital’	‘Fair’	0	117	133
49	80	‘Female’	64	‘Brown’	‘County General Hospital’	‘Good’	0	122	119
46	70	‘Female’	68	‘Davis’	‘St. Mary’s Medical Center’	‘Good’	0	121	142
33	88	‘Female’	64	‘Miller’	‘VA Hospital’	‘Good’	1	130	142
40	82	‘Male’	68	‘Wilson’	‘VA Hospital’	‘Good’	0	115	180
28	78	‘Male’	68	‘Moore’	‘St. Mary’s Medical Center’	‘Excellent’	0	115	183
31	86	‘Female’	66	‘Taylor’	‘County General Hospital’	‘Excellent’	0	118	132
45	77	‘Female’	68	‘Anderson’	‘County General Hospital’	‘Excellent’	0	114	128
42	68	‘Female’	66	‘Thomas’	‘St. Mary’s Medical Center’	‘Poor’	0	115	137
25	74	‘Male’	71	‘Jackson’	‘VA Hospital’	‘Poor’	0	127	174
39	95	‘Male’	72	‘White’	‘VA Hospital’	‘Excellent’	1	130	202
36	79	‘Female’	65	‘Harris’	‘St. Mary’s Medical Center’	‘Good’	0	114	129
48	92	‘Male’	71	‘Martin’	‘VA Hospital’	‘Good’	1	130	181
32	95	‘Male’	69	‘Thompson’	‘St. Mary’s Medical Center’	‘Excellent’	1	124	191
27	79	‘Female’	69	‘Garcia’	‘VA Hospital’	‘Fair’	1	123	131
37	77	‘Male’	70	‘Martinez’	‘County General Hospital’	‘Good’	0	119	179
50	76	‘Male’	68	‘Robinson’	‘County General Hospital’	‘Good’	0	125	172
48	75	‘Female’	65	‘Clark’	‘VA Hospital’	‘Excellent’	0	121	133
39	79	‘Female’	64	‘Rodriguez’	‘VA Hospital’	‘Fair’	0	123	117
41	88	‘Female’	62	‘Lewis’	‘VA Hospital’	‘Fair’	0	114	137
44	90	‘Female’	66	‘Lee’	‘County General Hospital’	‘Fair’	1	128	146
28	96	‘Female’	65	‘Walker’	‘County General Hospital’	‘Good’	1	129	123
25	77	‘Male’	70	‘Hall’	‘VA Hospital’	‘Poor’	0	114	189
39	80	‘Female’	63	‘Allen’	‘VA Hospital’	‘Excellent’	0	113	143
25	76	‘Female’	63	‘Young’	‘County General Hospital’	‘Good’	0	125	114
36	83	‘Male’	68	‘Hernandez’	‘County General Hospital’	‘Poor’	0	120	166
30	89	‘Male’	67	‘King’	‘County General Hospital’	‘Excellent’	1	127	186
45	92	‘Female’	70	‘Wright’	‘VA Hospital’	‘Excellent’	1	134	126
40	83	‘Female’	66	‘Lopez’	‘VA Hospital’	‘Poor’	0	121	137
25	80	‘Female’	64	‘Hill’	‘St. Mary’s Medical Center’	‘Excellent’	0	115	138
47	84	‘Male’	70	‘Scott’	‘St. Mary’s Medical Center’	‘Excellent’	0	127	187
44	92	‘Male’	71	‘Green’	‘County General Hospital’	‘Good’	0	121	193
48	83	‘Female’	66	‘Adams’	‘VA Hospital’	‘Excellent’	0	127	137
44	90	‘Male’	71	‘Baker’	‘VA Hospital’	‘Good’	1	136	192
35	85	‘Female’	66	‘Gonzalez’	‘St. Mary’s Medical Center’	‘Fair’	0	117	118
33	90	‘Male’	66	‘Nelson’	‘St. Mary’s Medical Center’	‘Good’	1	124	180
38	74	‘Female’	63	‘Carter’	‘St. Mary’s Medical Center’	‘Good’	0	120	128
39	92	‘Male’	71	‘Mitchell’	‘County General Hospital’	‘Fair’	1	128	164
44	80	‘Male’	69	‘Perez’	‘VA Hospital’	‘Excellent’	0	116	183
44	89	‘Male’	70	‘Roberts’	‘VA Hospital’	‘Good’	1	132	169
37	96	‘Male’	70	‘Turner’	‘VA Hospital’	‘Excellent’	1	137	194
45	89	‘Male’	67	‘Phillips’	‘VA Hospital’	‘Good’	0	117	172
37	77	‘Female’	65	‘Campbell’	‘County General Hospital’	‘Fair’	0	116	135
30	81	‘Male’	68	‘Parker’	‘VA Hospital’	‘Poor’	0	119	182
39	76	‘Female’	62	‘Evans’	‘County General Hospital’	‘Good’	0	123	121
42	83	‘Male’	70	‘Edwards’	‘County General Hospital’	‘Excellent’	0	116	158
42	78	‘Male’	67	‘Collins’	‘County General Hospital’	‘Good’	1	124	179
49	95	‘Male’	68	‘Stewart’	‘County General Hospital’	‘Poor’	1	129	170
44	91	‘Female’	62	‘Sanchez’	‘St. Mary’s Medical Center’	‘Good’	1	130	136
43	91	‘Female’	64	‘Morris’	‘County General Hospital’	‘Poor’	1	132	135
47	86	‘Female’	66	‘Rogers’	‘VA Hospital’	‘Excellent’	0	117	147
50	89	‘Male’	72	‘Reed’	‘VA Hospital’	‘Excellent’	1	129	186
38	79	‘Female’	63	‘Cook’	‘VA Hospital’	‘Excellent’	0	118	124
41	74	‘Female’	66	‘Morgan’	‘St. Mary’s Medical Center’	‘Good’	0	120	134
45	82	‘Male’	70	‘Bell’	‘St. Mary’s Medical Center’	‘Good’	1	138	170
36	76	‘Male’	71	‘Murphy’	‘VA Hospital’	‘Good’	0	117	180
38	81	‘Female’	68	‘Bailey’	‘St. Mary’s Medical Center’	‘Good’	0	113	130
29	77	‘Female’	63	‘Rivera’	‘County General Hospital’	‘Excellent’	0	122	130
28	73	‘Female’	65	‘Cooper’	‘VA Hospital’	‘Good’	0	115	127
30	85	‘Female’	67	‘Richardson’	‘County General Hospital’	‘Excellent’	0	120	141
28	76	‘Female’	66	‘Cox’	‘County General Hospital’	‘Good’	0	117	111
29	80	‘Female’	68	‘Howard’	‘VA Hospital’	‘Excellent’	0	123	134
36	80	‘Male’	71	‘Ward’	‘St. Mary’s Medical Center’	‘Good’	0	123	189
45	79	‘Female’	70	‘Torres’	‘County General Hospital’	‘Excellent’	0	119	137
32	82	‘Female’	60	‘Peterson’	‘County General Hospital’	‘Excellent’	0	110	136
31	79	‘Female’	64	‘Gray’	‘VA Hospital’	‘Excellent’	0	121	130
48	82	‘Female’	64	‘Ramirez’	‘County General Hospital’	‘Excellent’	1	138	137
25	75	‘Male’	66	‘James’	‘County General Hospital’	‘Good’	0	125	186
40	91	‘Female’	64	‘Watson’	‘VA Hospital’	‘Fair’	1	122	127
39	74	‘Male’	72	‘Brooks’	‘St. Mary’s Medical Center’	‘Excellent’	0	120	176
41	78	‘Female’	65	‘Kelly’	‘St. Mary’s Medical Center’	‘Poor’	0	117	127
33	85	‘Female’	67	‘Sanders’	‘St. Mary’s Medical Center’	‘Excellent’	1	125	115
31	84	‘Male’	72	‘Price’	‘VA Hospital’	‘Fair’	1	124	178
35	75	‘Female’	64	‘Bennett’	‘County General Hospital’	‘Fair’	0	121	131
32	78	‘Male’	68	‘Wood’	‘St. Mary’s Medical Center’	‘Poor’	0	118	183
42	81	‘Male’	66	‘Barnes’	‘County General Hospital’	‘Excellent’	0	120	194
48	79	‘Female’	64	‘Ross’	‘VA Hospital’	‘Good’	0	118	126
34	85	‘Male’	68	‘Henderson’	‘St. Mary’s Medical Center’	‘Good’	0	118	186
39	79	‘Male’	69	‘Coleman’	‘VA Hospital’	‘Excellent’	0	122	188
28	82	‘Male’	69	‘Jenkins’	‘County General Hospital’	‘Good’	1	134	189
29	80	‘Female’	64	‘Perry’	‘St. Mary’s Medical Center’	‘Good’	0	131	120
32	80	‘Female’	63	‘Powell’	‘VA Hospital’	‘Excellent’	0	113	132
39	92	‘Male’	68	‘Long’	‘County General Hospital’	‘Good’	1	125	182
37	92	‘Female’	65	‘Patterson’	‘County General Hospital’	‘Poor’	1	135	120
49	96	‘Female’	63	‘Hughes’	‘County General Hospital’	‘Good’	1	128	123
31	87	‘Female’	66	‘Flores’	‘VA Hospital’	‘Good’	1	123	141
37	81	‘Female’	65	‘Washington’	‘St. Mary’s Medical Center’	‘Good’	0	122	129
38	90	‘Male’	68	‘Butler’	‘County General Hospital’	‘Excellent’	1	138	184
45	77	‘Male’	71	‘Simmons’	‘VA Hospital’	‘Excellent’	0	124	181
30	91	‘Female’	70	‘Foster’	‘St. Mary’s Medical Center’	‘Fair’	0	130	124
48	79	‘Male’	71	‘Gonzales’	‘County General Hospital’	‘Good’	0	123	174
48	73	‘Female’	66	‘Bryant’	‘County General Hospital’	‘Excellent’	0	129	134
25	99	‘Male’	69	‘Alexander’	‘County General Hospital’	‘Good’	1	128	171
44	92	‘Male’	69	‘Russell’	‘VA Hospital’	‘Good’	1	124	188
49	74	‘Male’	70	‘Griffin’	‘County General Hospital’	‘Fair’	0	119	186
45	93	‘Male’	68	‘Diaz’	‘County General Hospital’	‘Good’	1	136	172
48	86	‘Male’	66	‘Hayes’	‘County General Hospital’	‘Fair’	0	114	177

Publications

Videos

I’ve recorded a 45 minute video on how to bring machine learning to the next level in an applied Wine Quality Prediction Project.

If you’re not ready for that and want a tutorial on the basics of Machine Learning, my 1.5 hour Overview of Machine Learning might be better. It will guide you through many of the general concepts, as well as some of the various models listed below.

More RPubs:

Contact

Email: mortensengarth@hotmail.com

LinkedIn: https://www.linkedin.com/in/mortensengarth/.

Lasso Regression: Predicting Systolic

Garth Mortensen

September 2, 2018